Spot the bot: the inverse problems of NLP

Autor: Vasilii A. Gromov, Quynh Nhu Dang, Alexandra S. Kogan, Assel Yerbolova
Jazyk: angličtina
Rok vydání: 2024
Předmět:
Zdroj: PeerJ Computer Science, Vol 10, p e2550 (2024)
Druh dokumentu: article
ISSN: 2376-5992
DOI: 10.7717/peerj-cs.2550
Popis: This article concerns the problem of distinguishing human-written and bot-generated texts. In contrast to the classical problem formulation, in which the focus falls on one type of bot only, we consider the problem of distinguishing texts written by any person from those generated by any bot; this involves analysing the large-scale, coarse-grained structure of the language semantic space. To construct the training and test datasets, we propose to separate not the texts of bots, but bots themselves, so the test sample contains the texts of those bots (and people) that were not in the training sample. We aim to find efficient and versatile features, rather than a complex classification model architecture that only deals with a particular type of bots. In the study we derive features for human-written and bot generated texts, using clustering (Wishart and K-Means, as well as fuzzy variations) and nonlinear dynamic techniques (entropy-complexity measures). We then deliberately use the simplest of classifiers (support vector machine, decision tree, random forest) and the derived characteristics to identify whether the text is human-written or not. The large-scale simulation shows good classification results (a classification quality of over 96%), although varying for languages of different language families.
Databáze: Directory of Open Access Journals