Unsupervised morphological analysis of minority languages with NPYLM: –Consideration for situations where training data is too small–

Autor: Haruhiko Takase, Katsuko Tomotsugu, Shinya Matsushita, Toshiaki Takano
Rok vydání: 2021
Předmět:
Zdroj: 2021 Joint 10th International Conference on Informatics, Electronics & Vision (ICIEV) and 2021 5th International Conference on Imaging, Vision & Pattern Recognition (icIVPR).
DOI: 10.1109/icievicivpr52578.2021.9564146
Popis: In this study, we propose a method based on NPYLM to support the segmentation of speech information into words in the archive work performed by linguists. Due to limited data and prior linguistic knowledge in minority languages, training data for unsupervised morphological analysis was not large enough to efficiently construct NPYLM. We propose two methods to improve the accuracy of analysis with NPYLM for application to small data. The first is replacing all the words obtained in the previous steps with different symbols, and the second is replacing only the uncommon words based on TF-IDF with other symbols. Our experiments show that both of these two methods worked effectively. Therefore we confirm that unsupervised morphological analysis with NPYLM supports the segmentation of speech information into word units even when the available data size is small.
Databáze: OpenAIRE