Indigenous language technology in the age of machine learning.

Autor: Moshagen, Sjur Nørstebø, Antonsen, Lene, Wiechetek, Linda, Trosterud, Trond
Předmět:
Zdroj: Acta Borealia; 2024, Vol. 41 Issue 2, p102-116, 15p
Abstrakt: Most modern language technology for proofing tools, machine translation and other applications is based on machine learning. However, very few Indigenous languages have the necessary amount of texts for making tools based on this technology. When most language technology is based on large language models (LLMs), it bears the risk of most of Indigenous language online text being produced by neural text generation. The result would be that online texts cannot be trusted as a source for authentic Indigenous languages anymore. An alternative is the work done at UiT – The Arctic University of Norway during the last 20 years, based on linguistics. Sámi language tools have been made available for both industry and language communities, with open licenses. These have been widely used by translators, teachers and various software companies. The article analyzes the following four parts of language technology development: language data, language tool development, making the tools available to users, and ethical use of available language technology tools. We make extensive use of the CARE principles, and discuss the shortcomings of existing software and data licensing schemes. Finally, we introduce a 3D table to help classify language technology projects with respect to their suitability for Indigenous languages. [ABSTRACT FROM AUTHOR]
Databáze: Complementary Index