Twitter has a fundamental problem: It’s broken. Worse, the company seems unable to understand how to fix itself, and its users increasingly recognize the problem and are getting frustrated at the lack of any solution.

As someone who does community management for a living and uses Twitter as my primary social network, I see the problems on a daily basis, I see the friends of mine who have cut back their usage or given up entirely, and I find myself constantly self-editing my use of the service to stay away from topics I know are more likely to bring out the trolls, because Twitter simply doesn’t have the tools in place for me to protect myself if they arrive.

I shouldn’t have to self-edit to protect myself from attacks, but that’s the state of Twitter today.

How to fix twitter? Can Twitter be fixed? I have had this conversation multiple times recently, and each ended up turning into “it’s hard to explain, but….”, and that tells me it’s time to try to sort the problems and explain how I’d tackle the problems if it were my problem to solve. Which, fortunately, it’s not.

Twitter is fundamentally broken

How do I define the way Twitter is broken?

A user of a social system should have an expectation that they can feel safe using it. If a user gets attacked on the service, they should be able to mute the attack. People who follow that user shouldn’t be attacked because they follow them. A user should be able to report attacks and abuse with a expectation that the service will evaluate the report and act as necessary to protect them.

Twitter fails at this on all counts. This failure goes back to 2014 and well before. It continues today, even though in February they announced their latest “hey, we’re getting serious about fixing the abuse problem”. Six months later, we’re still waiting — for anything substantive.

It’s clear to me they don’t know how to fix it, and that management really isn’t committed to wanting it fixed. Here’s one problem: Twitter uses how many accounts exist on the system and Monthly Active Users (MAU) as numbers used to judge the health of the company in their reports to the financial markets.

How is that a problem? Based on different studies as many as 15% of Twitter accounts, upwards of 48 million, are run by bots. Twitter says that number is only 8%, as if that’s a good thing. Because Twitter counts these bots as part of the numbers used to explain to financial analysts the health of the company, there is a significant disincentive here for Twitter to get serious about shutting down bots and abusive accounts that might lead to reduced numbers the analysts might hit the stock for.

So my take is that Twitter really doesn’t want to clean up the abusive bots, because they’re about the only part of their user base that’s growing, and they won’t take the risk of a bad reaction by Wall Street if the numbers drop.

Instead? Upset the wrong person and they’ll fire up part of the botnet and inundate your twitter account and those of your followers with a huge volume of abusive messages. There was some hope that new tools released last January would help, but we’ve seen in the months since that there really hasn’t been a significant improvement.

Abuse is rampant, and it’s driving an increasing number of the people who make Twitter useful into the shadows or off the service completely. We’ve seen how this play ends: this was in many ways how USENET faded to irrelevance (and warez and porn).

What to do? I’ve said more than once I think we’re at the point that the best way to save Twitter from itself is to have it be bought by another company, have existing management tossed out and replaced by a team that is willing to get serious about solving the problems of the service.

Will it happen? I have no idea, but I hope so.

But in the mean time, I thought would be interesting to discuss how I would approach the problem if I ever had to dig into it and try to discuss both the challenges and opportunities that exist.

The challenge of web-scale

Ultimately the problem at Twitter is a policy problem and a community management problem, which is why it’s been of interest to me. The first challenge of community management is that it doesn’t scale well. A community manager can handle a small group — depending on the population into a few tens of thousands — successfully, but as the group continues to grow the ability to cover it well and consistently becomes a challenge.

Now, grow that problem from tens of thousands to tens or hundreds of millions. You literally couldn’t hire enough talent to cover a community that size the way you would a smaller one. Youtube has 300 hours of video uploaded to it per minute. Stop and imagine the scale of a group charged to review and approve that content.

So you can’t hire your way out of the problem. You need technology. Technology pushes us in the other direction, though, where companies become overly reliant on algorithms to solve the problem. A good example of this kind of thinking is the most recent complaint about Facebook where it was found people could target ads to groups like “Jew Hater”. Facebook’s answer to this? More human oversight. Where did this problem come from? Building a system that assumed that the technology would prevent problems. Which it did: only it can only solve problems the humans know to program it for, and this wasn’t one of them.

So the answer to solving these problems is to use technology to amplify and leverage a human component.

My tool of choice? A reputation system driven by a Machine Learning setup.

Reputation Systems

15 years ago, we thought e-mail was going die. The spam problem was overwhelming our inboxes were exploding in spam. But technology allowed us to solve the problem. Systems known as Realtime Blackhole (RBL) systems were created, which tracked misbehaving or abusive email servers and could be queried by a receiving email server that wanted to know whether or not to accept mail being offered.

The early versions of these RBLs were very manual so scaling was a problem, and some of them had political biases that made their data more or less reliable depending on your agreement with their policies. Over time, however, the RBLs figured out how to automate accepting and interpreting the data they got on spam email — and in many ways were the first reputation systems that were built and operated at scale.

And they work. I still get one or two spam emails into my inbox in a month, and if I look in the spam folder, I get a few hundred spam mails that get accepted into the system and then rejected, but effectively, my inbox — which has existed and been public on web pages since 1995 — gets no spam.

Reputation systems are conceptually a simple beast: every thing we want to track in them has an identity, and each identity has a reputation which is some number. For email the identity is the IP address of the mail server, where on Twitter it would be the account name.

The reputation starts at some default value, say, 1,000. And then based on activities taken by that identity and how the system interprets them, we increase or decrease that reputation. Over time, the reputation changes to a value that represents how constructive or destructive that identity is to the systems we’re tracking.

And we can make decisions based on that identity reputation. This allows us to modify the behavior of the system based on the identity we are dealing with at any given moment.

Reputation systems are a tool we can adopt to help solve this. Basically, we assign an identity to the things we want to manage, and we assign a reputation value to each identity. Over time we adjust that value up or down based on the actions that identity takes, and then use that value to decide how to adjust the operations of the system as it interacts with that identity.

Twitter actually needs two reputation systems: one is tied to the identity of its users, and one is tied to the links that are used in postings to twitter. That latter one actually needs to go one level deeper, because the reputation should be built based on the final content the link points to, so that all links that end up pointing to the same source end up with the same reputation.

So we start simple: a new twitter user starts out with a reputation of 1,000. This user posts a tweet, which includes a link to something. If someone likes the tweet, their reputation goes up by a point (and same with the URL). If someone reports the tweet as a problem, their reputation goes down by a point.

Over time, if a user consistently contributes material that’s liked, their reputation increases. If they contribute material that’s reported more than liked, their reputation goes down. How valuable that user is to the community as a whole can be roughly defined by their reputation value.

But there’s a problem: reporting can and is used as an abuse vector, and this is something most social sites absolutely suck at dealing with. Facebook is particularly bad at recognizing attacks on users through the reporting channels, and I’m constantly seeing people reporting their accounts locked because of this.

A quick digression on this challenge: back when I was working as Community Manager at Palm, I went to a meeting with a product manager to talk about proposed pages to the App Store. Her proposal was to add buttons for people to report apps that were abusive or contained inappropriate materials. Her plan was if we got those reports, those apps would be pulled from the store for evaluation.

My first question to her was “How do you think this will work when developers start flagging their competitors to get them pulled from the store?” And her response was simply “They’d do that?”

That was, I think, the moment I realized I needed to leave Palm. And here’s an important hint for success: don’t let people who aren’t community users and managers design your communities. Bad things will happen.

By the way, we agreed to meet again and I told her I’d help her design a proper abuse reporting system. She never set up that meeting or talked to me again, which was typical at Palm. But fortunately, they never implemented what she was working on, either.

Using weighted average on problem reports

Anyway, back o the problem at hand. Let’s start to deal with this problem by using an accounts reputation to judge how seriously to value the action it takes. If an account with a reputation of 1,000 reports a tweet, that report would count as 1. If the account has a reputation of 1,500, it might count as 1.5 reports. If the reputation is 500 instead, that report might only count as half a report. One way you can implement this is by using a weighted average, where you can come up with a metric based on all of the actions taken on a tweet weighted by the reputations of those taking the actions.

In practice, we don’t look at a single piece of information based on a single report, but we collect all of the reports — positive and negative — together using a weighted average, and we end up with a single value that gives us the relative quality of that thing we’re judging, based on all of the reports made on it. It is, effectively, the reputation of that thing, based on the reports of users taking action on it, balanced by the reputation of those users.

This allows us to do some interesting things. For instance, the queue of things being reported as problems is always going to be much larger than the staff evaluating those reports will be able to review, but now we can take the list of reported items, sort it by the calculated reputation, and we now have a list of those generating the most reactions (negative and positive) that can be evaluated for disciplinary action or positive recognition (respectively). This allows you to use this basic reputation data to amplify your abuse team and steer them towards the hottest issues as they appear, allowing for faster and more reliable responses to those problems.

This also brings the abuse team actions into play for modifying reputations. Let’s say they review a tweet and decide it’s abusive, so they flag it for deletion. That action caused it to be removed from the system, but it does more. Other things we should do based on this administrative action include:

  • Bump the reputation of the user posting the abusive tweet down by a large number, say 250.
  • Bump the reputation of those reporting that tweet up by some number, say 100.
  • Bump the reputation of those liking the abusive tweet down by some number, say 100.
  • Set the reputation of the included link to negative infinity
  • Bump the reputation of users who retweeted the tweet down by a smaller number, say 50.

That latter is itself a problem: many users retweet a problem tweet to condemn it. In this system, they’ll take a reputation hit for it. This is something we could argue about, but I feel that’s warranted because they are also widening the audience of the tweet as they attempt to condemn it; instead, they should be reporting it into the system for action. This would require some education and retraining to help users change this habit (and it also requires users learning to trust that reporting problems will cause actions to be taken reliably).

We’ve built some useful tools that can help us get a handle on what the users are doing on the site and the content they are posting. In the second part of this article, we’ll look at ways we can leverage this data and take advantage of it to clean up the problems in a social network, even at the scaling size of a site like Twitter.