Bootstrapping Techniques for Polysynthetic Morphological Analysis
Autor: | William S. Lane, Steven Bird |
---|---|
Jazyk: | angličtina |
Rok vydání: | 2020 |
Předmět: |
FOS: Computer and information sciences
Computer Science - Computation and Language Computer science business.industry Bootstrapping (linguistics) 02 engineering and technology computer.software_genre 03 medical and health sciences 0302 clinical medicine Morpheme Polysynthetic language Australian language 030221 ophthalmology & optometry 0202 electrical engineering electronic engineering information engineering 020201 artificial intelligence & image processing Artificial intelligence business computer Computation and Language (cs.CL) Word (computer architecture) Natural language processing Natural language |
Zdroj: | ACL |
Popis: | Polysynthetic languages have exceptionally large and sparse vocabularies, thanks to the number of morpheme slots and combinations in a word. This complexity, together with a general scarcity of written data, poses a challenge to the development of natural language technologies. To address this challenge, we offer linguistically-informed approaches for bootstrapping a neural morphological analyzer, and demonstrate its application to Kunwinjku, a polysynthetic Australian language. We generate data from a finite state transducer to train an encoder-decoder model. We improve the model by “hallucinating” missing linguistic structure into the training data, and by resampling from a Zipf distribution to simulate a more natural distribution of morphemes. The best model accounts for all instances of reduplication in the test set and achieves an accuracy of 94.7% overall, a 10 percentage point improvement over the FST baseline. This process demonstrates the feasibility of bootstrapping a neural morph analyzer from minimal resources. |
Databáze: | OpenAIRE |
Externí odkaz: |