To build our toxicity and hate-speech classifiers, thousands of posts first had to be manually classified by specially trained research staff. This means that for every post they had to indicate whether it contained hate speech, toxicity, or neither; these annotations were then used to train our classifiers (see details about our algorithm). To ensure that all annotators shared the same understanding of “hate speech” and “toxicity,” a dedicated codebook was developed by ETH experts to break down the concepts in detail and provide concrete examples. This codebook includes the following instructions:

General Guidelines for Annotation

  • Always base your annotations on the instructions in the codebook.
  • Do not annotate for more than 1 hour at a time. Inform someone on the team if you feel overwhelmed.
  • Read every comment twice before making an annotation.
  • Do not over-interpret the text. If, after reading it twice, the presence of a concept is still unclear, code the concept as NOT PRESENT.
  • Annotate comments strictly according to the definitions in the codebook; do not use other definitions of these concepts or rely on your own “gut feeling.”

Toxicity

An umbrella term for a variety of forms of malicious and offensive communication.

  • Threats: Indications that the target will be harmed, or calls for others to do so.
  • Insults: Derogatory terms, including milder insults such as “idiot” or “stupid.”
  • Defamation: “An attack on someone’s reputation or integrity” (Oxford Dictionary). Labeling the recipient as a liar, corrupt, or a traitor.
  • Vulgarity: Use of swear words, e.g., “shit.”
  • Degrading or demeaning language: Language that attributes negative qualities to recipients (e.g., “even a five-year-old would understand that”) to shame them or diminish their reputation in the eyes of others.
  • Malice: Wishing harm upon the targets, e.g., “You should kill yourself.”
  • Exclusion: Telling others to shut up or that they are not welcome to express their opinion in a debate.

Hate Speech

Hate speech is (i) toxic speech (see previous section) that (ii) targets an individual or group in society based on their identity characteristics.
Hate speech is not necessarily a more severe form of toxicity.

Target groups include:

  • Nationality
  • Ethnicity / skin color
  • Migration status
  • Religion
  • Gender
  • Sexual orientation
  • (Severe) disability
  • Age

Implicit Targeting

References to groups may be implicit, which can make it harder to assess whether they qualify as hate speech. We consider the following statements to be hate speech:

  • Comments using stereotypical slurs for identity groups (e.g., “East Coast” as a stand-in for Jewish people).
  • Comments making clearly derogatory statements about an entire identity group, even if wording refers only to a subgroup (e.g., “Islamists” used to describe all Muslims).
  • Comments using identity groups as insults, even when the insult is not directed at that group (“that’s gay,” “you mongo”).