Text Data Augmentation Techniques for Word Embeddings in Fake News Classification

Autor:	Jozef Kapusta, David Drzik, Kirsten Steflovic, Kitti Szabo Nagy
Jazyk:	angličtina
Rok vydání:	2024
Předmět:	Back translation function word deletion synonym replacement text data augmentation Word2Vec word embeddings Electrical engineering. Electronics. Nuclear engineering TK1-9971
Zdroj:	IEEE Access, Vol 12, Pp 31538-31550 (2024)
Druh dokumentu:	article
ISSN:	2169-3536
DOI:	10.1109/ACCESS.2024.3369918
Popis:	Contemporary language models heavily rely on large corpora for their training. The larger the corpus, the better a model can capture various semantic relationships. The issue at hand appears to be the limited scope of the corpora used. One potential solution to this problem is the application of data augmentation techniques to expand the existing corpus. Data augmentation encompasses several techniques for corpus augmentation. In this article, we delve deeper into the analysis of three techniques: Synonym Replacement, Back Translation, and Reduction of Function Words. Utilizing these three techniques, we prepared diverse versions of the corpus employed for training Word2Vec Skip-gram models. These techniques were validated through extrinsic evaluation, wherein Word2Vec Skip-gram models were used to generate word vectors for classifying fake news articles. Performance measures of the generated classifiers were analyzed. The study highlights significant statistical differences in classifier outcomes between augmented and original corpora. Specifically, Back Translation significantly enhances accuracy, notably with Support Vector and Bernoulli Naive Bayes models. Conversely, the Reduction of Function Words (FWD) improves Logistic Regression, while the original corpus excels in Random Forest classification. The article also includes an intrinsic evaluation involving lexical semantic relations between word pairs. The intrinsic evaluation highlights nuanced differences in semantic relations across augmented corpora. Notably, the Back Translation (BT) corpus better aligns with established lexical resources, showcasing promising improvements in understanding specific semantic relationships.
Databáze:	Directory of Open Access Journals
Externí odkaz:	https://doaj.org/article/4c989e39e15146088571517ce29cf030 Zobrazit plný text záznamu View record in DOAJ