The Algorithm
To identify toxicity and hate speech, we use two custom-developed algorithms (classifiers). The development of these specialized classifiers consisted of the following stages:
1. Annotation
First, approximately 14,000 online posts (comments and other user-generated content) were selected from the data provided to us by our media partners. These posts were then manually classified by specially trained research assistants, based on our expert-developed codebook, as either hate speech, toxicity, or neither. For hate speech, the targets of hateful content—such as religion, gender, or nationality—were also recorded.
2. Training
Our two classifiers for detecting toxicity and hate speech are based on google-bert/bert-base-multilingual-cased, a widely used multilingual language model trained on thousands of books and the entire Wikipedia corpus. They were specifically adapted to German- and French-language comments from Switzerland.
Two different data sources were used for this adaptation:
- a newly collected 2024 comment dataset consisting of more than 14,000 carefully annotated posts from Swiss news platforms (about 25% of which were toxic), and
- a large comment dataset from an earlier hate-speech project (Kotarcic et al., 2023) containing over 420,000 posts.
The models were fine-tuned in two stages to achieve the best results: first on the older, larger dataset, and then on the newly collected, specifically annotated Swiss data. This two-stage approach produced the strongest performance. In addition, extensive hyperparameter searches were conducted, varying parameters such as learning rate, batch size, number of epochs, and class weighting factors. The best-performing model in each case was selected for deployment.
3. Validation
Using a subset of our manual annotations that were not included in training, we evaluated how accurately the various models detected toxicity and hate speech. The selection of this test set followed statistically rigorous methods designed to extract as much information as possible from the sample (Tomas-Valiente Jorda, 2025).
The best-performing models for toxicity and hate speech were then chosen. Specifically, we assessed how much more informative our classification is compared to random classification, and what proportion of truly toxic or hateful posts were detected (hit rate or sensitivity), as well as what proportion of posts labeled as toxic/hateful actually contained such content (precision).
On a representative, randomly assembled test dataset, the models achieved solid performance (AUC = 0.83; F1 = 0.55 for toxicity; AUC = 0.95; F1 = 0.42 for hate speech). The models show particularly strong sensitivity (toxicity = 0.79; hate speech = 0.9), meaning they detect a large share of all truly toxic or hateful posts. Precision is moderate (0.42; 0.28) but within the expected range for toxicity classification. Overall, the models achieve good discriminative power in distinguishing between toxic and non-toxic content and outperform the best available alternatives in our test data, including the Toxicity API from Google Perspective.
4. Debiasing
Both toxicity and hate speech evolve continuously—simply because language itself changes, with new terms emerging and old terms being redefined (e.g., “Schwurbler”). To prevent our classifiers from degrading over time, we apply a weekly adjustment method called debiasing (Egami et al., 2023).
Each week, a sample of 1,000 posts per media outlet is sent to an LLM to obtain up-to-date assessments of toxicity and hate speech. Based on these current annotations, the weekly classifier prevalence is adjusted. Finally, we account for uncertainty to incorporate the potential error rate of the LLM-based annotations.
Egami, N., Hinck, M., Stewart, B., & Wei, H. 2023. Using imperfect surrogates for downstream inference: Design-based supervised learning for social science applications of large language models. Advances in Neural Information Processing Systems, 36, 68589-68601.
Tomas-Valiente, F. 2025. Uncertain performance: How to quantify uncertainty and draw test sets when evaluating classifiers. Working paper.
Kotarcic, A., Hangartner, D., Gilardi, F., Kurer, S., and Donnay, K. 2022. Human-in-the-Loop Hate Speech Classification in a Multilingual Context. In: Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.