Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus
Autor: | Isaac Caswell, Daan van Esch, Ankur Bapna, Theresa Breiner |
---|---|
Jazyk: | angličtina |
Rok vydání: | 2020 |
Předmět: |
FOS: Computer and information sciences
Text corpus Computer Science - Machine Learning Computer science media_common.quotation_subject Context (language use) 02 engineering and technology computer.software_genre Machine Learning (cs.LG) Domain (software engineering) Similarity (psychology) 0202 electrical engineering electronic engineering information engineering Quality (business) media_common Computer Science - Computation and Language business.industry 020206 networking & telecommunications Variety (linguistics) Path (graph theory) 020201 artificial intelligence & image processing Data set (IBM mainframe) Artificial intelligence business Computation and Language (cs.CL) computer Natural language processing |
Zdroj: | COLING |
Popis: | Large text corpora are increasingly important for a wide variety of Natural Language Processing (NLP) tasks, and automatic language identification (LangID) is a core technology needed to collect such datasets in a multilingual context. LangID is largely treated as solved in the literature, with models reported that achieve over 90% average F1 on as many as 1,366 languages. We train LangID models on up to 1,629 languages with comparable quality on held-out test sets, but find that human-judged LangID accuracy for web-crawl text corpora created using these models is only around 5% for many lower-resource languages, suggesting a need for more robust evaluation. Further analysis revealed a variety of error modes, arising from domain mismatch, class imbalance, language similarity, and insufficiently expressive models. We propose two classes of techniques to mitigate these errors: wordlist-based tunable-precision filters (for which we release curated lists in about 500 languages) and transformer-based semi-supervised LangID models, which increase median dataset precision from 5.5% to 71.2%. These techniques enable us to create an initial data set covering 100K or more relatively clean sentences in each of 500+ languages, paving the way towards a 1,000-language web text corpus. Accepted to COLING 2020. 9 pages with 8 page abstract |
Databáze: | OpenAIRE |
Externí odkaz: |