Categorizing Online Harassment on Twitter

Autor: Norbert Zeh, Samuel Bruno da Silva Sousa, Mozhgan Saeidi, Evangelos E. Milios, Lilian Berton
Rok vydání: 2020
Předmět:
Zdroj: Machine Learning and Knowledge Discovery in Databases ISBN: 9783030438869
PKDD/ECML Workshops (2)
DOI: 10.1007/978-3-030-43887-6_22
Popis: Harassment on social media is a hard problem to tackle since those platforms are virtual spaces in which people enjoy the liberty to express themselves with no restrictions. Furthermore, a large amount of users generating publications on online media like Twitter contributes to the hardness of controlling sexism and sexual harassment content, requesting robust methods of Machine Learning (ML) to be applied in this task. To do so, this work aims at comparing the performance of supervised ML algorithms to categorize online harassment in Twitter posts. We tested Logistic Regression, Gaussian Naive Bayes, Decision Trees, Random Forest, Linear SVM, Gaussian SVM, Polynomial SVM, Multi-Layer Perceptron, and AdaBoost methods on the SIMAH Competition benchmark data, using TF-IDF vectors and Word2Vec embeddings as features. As results, we reached scores above 0.80% of accuracy for all the harassment types in the data. We also showed that, when using TF-IDF vectors, Linear and Gaussian SVM are the best methods to predict harassment content, while Decision Trees and Random Forest better categorize physical and sexual harassment. Overall, by using TF-IDF vectors presented higher performance on these data, suggesting that the training corpus for Word2Vec influenced negatively on the classification task outcomes.
Databáze: OpenAIRE