Simple Baseline Machine Learning Text Classifiers for Small Datasets

Autor: Achim Klein, Matthias Riekert, Martin Riekert
Rok vydání: 2021
Předmět:
Zdroj: SN Computer Science. 2
ISSN: 2661-8907
2662-995X
DOI: 10.1007/s42979-021-00480-4
Popis: Text classification is important to better understand online media. A major problem for creating accurate text classifiers using machine learning is small training sets due to the cost of annotating them. On this basis, we investigated how SVM and NBSVM text classifiers should be designed to achieve high accuracy and how the training sets should be sized to efficiently use annotation labor. We used a four-way repeated-measures full-factorial design of 32 design factor combinations. For each design factor combination 22 training set sizes were examined. These training sets were subsets of seven public text datasets. We study the statistical variance of accuracy estimates by randomly drawing new training sets, resulting in accuracy estimates for 98,560 different experimental runs. Our major contribution is a set of empirically evaluated guidelines for creating online media text classifiers using small training sets. We recommend uni- and bi-gram features as text representation, btc term weighting and a linear-kernel NBSVM. Our results suggest that high classification accuracy can be achieved using a manually annotated dataset of only 300 examples.
Databáze: OpenAIRE