Simple Baseline Machine Learning Text Classifiers for Small Datasets

Autor:	Achim Klein, Matthias Riekert, Martin Riekert
Rok vydání:	2021
Předmět:	Basis (linear algebra) business.industry Computer science 02 engineering and technology Machine learning computer.software_genre Term (time) Weighting Support vector machine Set (abstract data type) Annotation 020204 information systems Factor (programming language) 0202 electrical engineering electronic engineering information engineering 020201 artificial intelligence & image processing Artificial intelligence business Representation (mathematics) computer computer.programming_language
Zdroj:	SN Computer Science. 2
ISSN:	2661-8907 2662-995X
DOI:	10.1007/s42979-021-00480-4
Popis:	Text classification is important to better understand online media. A major problem for creating accurate text classifiers using machine learning is small training sets due to the cost of annotating them. On this basis, we investigated how SVM and NBSVM text classifiers should be designed to achieve high accuracy and how the training sets should be sized to efficiently use annotation labor. We used a four-way repeated-measures full-factorial design of 32 design factor combinations. For each design factor combination 22 training set sizes were examined. These training sets were subsets of seven public text datasets. We study the statistical variance of accuracy estimates by randomly drawing new training sets, resulting in accuracy estimates for 98,560 different experimental runs. Our major contribution is a set of empirically evaluated guidelines for creating online media text classifiers using small training sets. We recommend uni- and bi-gram features as text representation, btc term weighting and a linear-kernel NBSVM. Our results suggest that high classification accuracy can be achieved using a manually annotated dataset of only 300 examples.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::306aee27fe4b8be8561486dcbe102517 https://doi.org/10.1007/s42979-021-00480-4 Zobrazit plný text záznamu Full text from SpringerLink