Data Collection for Natural Language Processing Systems
Autor: | Emil Krsak, Michal Ďuračík, Stefan Toth, Miroslava Mikusova, Patrik Hrkut, Matej Mesko |
---|---|
Rok vydání: | 2020 |
Předmět: |
050101 languages & linguistics
Data collection Computer science business.industry Process (engineering) 05 social sciences 02 engineering and technology computer.software_genre Task (project management) Set (abstract data type) Annotation 0202 electrical engineering electronic engineering information engineering Web application 020201 artificial intelligence & image processing 0501 psychology and cognitive sciences Artificial intelligence business computer Natural language processing |
Zdroj: | Communications in Computer and Information Science ISBN: 9789811533792 ACIIDS (Companion) |
DOI: | 10.1007/978-981-15-3380-8_6 |
Popis: | Any NLP system needs enough data for training and testing purposes. They can be split into two datasets: correct and incorrect (erroneous) data. Usually, it is not a problem to find and get a set of correct data because the correct texts are available from different sources, although they may also contain some mistakes. On the other hand, it is a hard task to get data containing errors like typos, mistakes and misspellings. This kind of data is usually obtained by a lengthy manual process and it requires annotation by human. One way to get the incorrect dataset faster is to generate it. However, this creates a problem how to generate incorrect texts so that they correspond to real human mistakes. In this paper, we focused on getting the incorrect dataset by help of humans. We created an automated web application (a game) that allows to collect incorrect texts and misspellings from players for texts written in the Slovak language. Based on the obtained data, we built a model of common errors that can be used to generate a large amount of authentic looking erroneous texts. |
Databáze: | OpenAIRE |
Externí odkaz: |