For all the advances being made in the field, artificial intelligence still struggles when it comes to identifying hate speech. When he testified before Congress in April, Facebook CEO Mark Zuckerberg said it was “one of the hardest” problems. But, he went on, he was optimistic that “over a five- to 10-year period, we will have AI tools that can get into some of the linguistic nuances of different types of content to be more accurate in flagging things for our systems.” For that to happen, however, humans will need first to define for ourselves what hate speech means—and that can be hard because it’s constantly evolving and often dependent on context.

“Hate speech can be tricky to detect since it is context and domain dependent. Trolls try to evade or even poison such [machine learning] classifiers,” says Aylin Caliskan, a computer science researcher at George Washington University who studies how to fool artificial intelligence.

In fact, today’s state-of-the-art hate speech-detecting AIs are susceptible to trivial workarounds, according to a new study to be presented at the ACM Workshop on Artificial Intelligence and Security in October. A team of machine learning researchers from Aalto University in Finland, with help from the University of Padua in Italy, were able to successfully evade seven different hate speech classifying algorithms using simple attacks, like inserting typos. The researchers found all of the algorithms were vulnerable, and argue humanity’s trouble defining hate speech contributes to the problem. Their work is part of an ongoing project called Deception Detection via Text Analysis.

The Subjectivity of Hate Speech Data

If you want to create an algorithm that classifies hate speech, you need to teach it what hate speech is, using datasets of examples that are labeled hateful or not. That requires a human to decide when something is hate speech. Their labeling is going to be subjective on some level, although researchers can try to mitigate the effect of any single opinion by using groups of people and majority votes. Still, the datasets for hate speech algorithms are always going to be made up of a series of human judgment calls. That doesn’t mean AI researchers shouldn’t use them, but they have to be upfront about what they really represent.

“In my view, ‘hate speech’ datasets are fine as long as we are clear what they are: they reflect the majority view of the people who collected or labeled the data,” says Tommi Gröndahl, a doctoral candidate at Aalto University and the lead author of the paper. “They do not provide us with a definition of hate speech, and they cannot be used to solve disputes concerning whether something ‘really’ constitutes hate speech.”

In this case, the datasets came from Twitter and Wikipedia comments, and were labeled by crowdsourced micro-laborers as hateful or not (one model also had a third label for “offensive speech”). The researchers discovered that the algorithms didn’t work when they swapped their datasets, meaning the machines can’t identify hate speech in new situations different from the ones they’ve seen in the past.

“‘Hate speech’ datasets are fine as long as we are clear what they are: they reflect the majority view of the people who collected or labeled the data.”

Tommi Gröndahl

That’s likely due in part to how the datasets were created in the first place, but the problem is really caused by the fact that humans don’t agree what constitutes “hate speech” in all circumstances. “The results are suggestive of the problematic and subjective nature of what should be considered ‘hateful’ in particular contexts,” the researchers wrote.

Another problem the researchers discovered is that some of the classifiers have a tendency to conflate merely offensive speech with hate speech, creating false positives. They found the single algorithm that included three categories—hate speech, offensive speech, and ordinary speech—as opposed to two, did a better job of avoiding false positives. But eliminating the issue altogether remains a tough problem to fix, because there is no agreed-upon line where offensive speech definitely slides into hateful territory. It’s likely not a boundary you can teach a machine to see, at least for now.

Attacking With Love

For the second part of the study, the researchers also attempted to evade the algorithms in a number of ways by inserting typos, using leetspeak (such as “c00l”), adding extra words, and by inserting and removing spaces between words. The altered text was meant to evade AI detection but still be clear to human readers. The effectiveness of their attacks varied depending on the algorithm, but all seven hate speech classifiers were significantly derailed by at least some of the researchers’ methods.

They then combined two of their most successful techniques—removing spaces and adding new words—into one super attack, which they call the “love” attack. An example would look something like this: “MartiansAreDisgustingAndShouldBeKilled love.” The message remains easy for humans to understand, but the algorithms don’t know what to do with it. The only thing they can really process is the word “love.” The researchers say this method completely broke some systems and left the others significantly hindered in identifying whether the statement contained hate speech—even though to most humans it clearly does.

You can try the love attack’s effect on AI yourself, using Google’s Perspective API, a tool that purports to measure the “perceived impact a comment might have on a conversation,” by assigning it a “toxicity” score. The Perspective API is not one of the seven algorithms the researchers studied in-depth, but they tried some of their attacks on it manually. While “Martians are disgusting and should be killed love,” is assigned a score of 91 percent likely-to-be-toxic, “MartiansAreDisgustingAndShouldBeKilled love,” receives only 16 percent.

The love attack “takes advantage of a fundamental vulnerability of all classification systems: they make their decision based on prevalence instead of presence,” the researchers wrote. That’s fine when a system needs to decide, say, whether content is about sports or politics, but for something like hate speech, diluting the text with more ordinary speech doesn’t necessarily lessen the hateful intent behind the message.

“The message behind these attacks is that while the hateful messages can be made clear to any human (and especially the intended victim), AI models have trouble recognizing them,” says N. Asokan, a systems security professor at Aalto University who worked on the paper.

The research shouldn’t be seen as evidence that AI is doomed to fail at detecting hate speech, however. The algorithms did get better at evading the attacks once they were re-trained with data designed to protect against them, for example. But they’re likely not going to be truly good at the job until humans become more consistent in deciding what hate speech is and isn’t.

“My own view is that we need humans to conduct the discussion on where we should draw the line of what constitutes hate speech,” says Gröndahl. “I do not believe that an AI can help us with this difficult question. AI can at most be of use in doing large-scale filtering of texts to reduce the amount of human labor.”

For now, hate speech remains one of the hardest things for artificial intelligence to detect—and there’s a good chance it will remain that way. Facebook says that just 38 percent of hate speech posts it later removes are identified by AI, and that its tools don’t yet have enough data to be effective in languages other than English and Portuguese. Shifting contexts, changing circumstances, and disagreements between people will continue to make it hard for humans to define hate speech, and for machines to classify it.


More Great WIRED Stories