Categorizing Online Harassment on Twitter
Autor: | Norbert Zeh, Samuel Bruno da Silva Sousa, Mozhgan Saeidi, Evangelos E. Milios, Lilian Berton |
---|---|
Rok vydání: | 2020 |
Předmět: |
Computer science
business.industry 050901 criminology 05 social sciences Decision tree 02 engineering and technology Perceptron Machine learning computer.software_genre Random forest Support vector machine Naive Bayes classifier ComputingMethodologies_PATTERNRECOGNITION 0202 electrical engineering electronic engineering information engineering Harassment 020201 artificial intelligence & image processing Word2vec AdaBoost Artificial intelligence 0509 other social sciences business computer |
Zdroj: | Machine Learning and Knowledge Discovery in Databases ISBN: 9783030438869 PKDD/ECML Workshops (2) |
DOI: | 10.1007/978-3-030-43887-6_22 |
Popis: | Harassment on social media is a hard problem to tackle since those platforms are virtual spaces in which people enjoy the liberty to express themselves with no restrictions. Furthermore, a large amount of users generating publications on online media like Twitter contributes to the hardness of controlling sexism and sexual harassment content, requesting robust methods of Machine Learning (ML) to be applied in this task. To do so, this work aims at comparing the performance of supervised ML algorithms to categorize online harassment in Twitter posts. We tested Logistic Regression, Gaussian Naive Bayes, Decision Trees, Random Forest, Linear SVM, Gaussian SVM, Polynomial SVM, Multi-Layer Perceptron, and AdaBoost methods on the SIMAH Competition benchmark data, using TF-IDF vectors and Word2Vec embeddings as features. As results, we reached scores above 0.80% of accuracy for all the harassment types in the data. We also showed that, when using TF-IDF vectors, Linear and Gaussian SVM are the best methods to predict harassment content, while Decision Trees and Random Forest better categorize physical and sexual harassment. Overall, by using TF-IDF vectors presented higher performance on these data, suggesting that the training corpus for Word2Vec influenced negatively on the classification task outcomes. |
Databáze: | OpenAIRE |
Externí odkaz: |