Data Collection for Natural Language Processing Systems

Autor:	Emil Krsak, Michal Ďuračík, Stefan Toth, Miroslava Mikusova, Patrik Hrkut, Matej Mesko
Rok vydání:	2020
Předmět:	050101 languages & linguistics Data collection Computer science business.industry Process (engineering) 05 social sciences 02 engineering and technology computer.software_genre Task (project management) Set (abstract data type) Annotation 0202 electrical engineering electronic engineering information engineering Web application 020201 artificial intelligence & image processing 0501 psychology and cognitive sciences Artificial intelligence business computer Natural language processing
Zdroj:	Communications in Computer and Information Science ISBN: 9789811533792 ACIIDS (Companion)
DOI:	10.1007/978-981-15-3380-8_6
Popis:	Any NLP system needs enough data for training and testing purposes. They can be split into two datasets: correct and incorrect (erroneous) data. Usually, it is not a problem to find and get a set of correct data because the correct texts are available from different sources, although they may also contain some mistakes. On the other hand, it is a hard task to get data containing errors like typos, mistakes and misspellings. This kind of data is usually obtained by a lengthy manual process and it requires annotation by human. One way to get the incorrect dataset faster is to generate it. However, this creates a problem how to generate incorrect texts so that they correspond to real human mistakes. In this paper, we focused on getting the incorrect dataset by help of humans. We created an automated web application (a game) that allows to collect incorrect texts and misspellings from players for texts written in the Slovak language. Based on the obtained data, we built a model of common errors that can be used to generate a large amount of authentic looking erroneous texts.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::64f9ce78330feb31d7c4b17832ad4710 https://doi.org/10.1007/978-981-15-3380-8_6 Zobrazit plný text záznamu