Data Collection for Natural Language Processing Systems

Autor: Emil Krsak, Michal Ďuračík, Stefan Toth, Miroslava Mikusova, Patrik Hrkut, Matej Mesko
Rok vydání: 2020
Předmět:
Zdroj: Communications in Computer and Information Science ISBN: 9789811533792
ACIIDS (Companion)
DOI: 10.1007/978-981-15-3380-8_6
Popis: Any NLP system needs enough data for training and testing purposes. They can be split into two datasets: correct and incorrect (erroneous) data. Usually, it is not a problem to find and get a set of correct data because the correct texts are available from different sources, although they may also contain some mistakes. On the other hand, it is a hard task to get data containing errors like typos, mistakes and misspellings. This kind of data is usually obtained by a lengthy manual process and it requires annotation by human. One way to get the incorrect dataset faster is to generate it. However, this creates a problem how to generate incorrect texts so that they correspond to real human mistakes. In this paper, we focused on getting the incorrect dataset by help of humans. We created an automated web application (a game) that allows to collect incorrect texts and misspellings from players for texts written in the Slovak language. Based on the obtained data, we built a model of common errors that can be used to generate a large amount of authentic looking erroneous texts.
Databáze: OpenAIRE