HateVersarial: Adversarial attack against hate speech detection algorithms on Twitter

Edita Grolman, Hodaya Binyamini, Asaf Shabtai, Yuval Elovici, Ikuya Morikawa, Toshiya Shimizu

Proceedings of the 30th ACM Conference on User Modeling, Adaptation and …, 2022

Machine learning (ML) models are commonly used to detect hate speech, which is considered one of the main challenges of online social networks. However, ML models have been shown to be vulnerable to well-crafted input samples referred to as adversarial examples. In this paper, we present an adversarial attack against hate speech detection models and explore the attack’s ability to: (1) prevent the detection of a hateful user, which should result in termination of the user’s account, and (2) classify normal users as hateful, which may lead to the termination of a legitimate user’s account. The attack is targeted at ML models that are trained on tabular, heterogeneous datasets (such as the datasets used for hate speech detection) and attempts to determine the minimal number of the most influential mutable features that should be altered in order to create a successful adversarial example. To demonstrate and …