User-Generated Text Corpus for Evaluating Japanese Morphological Analysis and Lexical Normalization
Autor: | Higashiyama, Shohei, Utiyama, Masao, Watanabe, Taro, Sumita, Eiichiro |
---|---|
Rok vydání: | 2021 |
Předmět: | |
Druh dokumentu: | Working Paper |
Popis: | Morphological analysis (MA) and lexical normalization (LN) are both important tasks for Japanese user-generated text (UGT). To evaluate and compare different MA/LN systems, we have constructed a publicly available Japanese UGT corpus. Our corpus comprises 929 sentences annotated with morphological and normalization information, along with category information we classified for frequent UGT-specific phenomena. Experiments on the corpus demonstrated the low performance of existing MA/LN methods for non-general words and non-standard forms, indicating that the corpus would be a challenging benchmark for further research on UGT. Comment: NAACL-HLT 2021 |
Databáze: | arXiv |
Externí odkaz: |