Automatic lemmatization in Setswana: towards a prototype

Autor: Karien Brits, Gerhard B. van Huyssteen, Rigardt Pretorius
Rok vydání: 2005
Předmět:
Zdroj: South African Journal of African Languages. 25:37-47
ISSN: 2305-1159
0257-2117
DOI: 10.1080/02572117.2005.10587247
Popis: Development of human language technologies for the indigenous South African languages is currently being undertaken in various projects across South Africa. In one such project a lemmatizer for Setswana is being developed, and this article reports on work towards the development of a first prototype. A prerequisite of lemmatization is to determine what the output of a lemmatizer for a specific language should be (i.e. what should be considered a lemma in that language). Consequently, the concept of a lemma as it should be understood in the context of Setswana lemmatization is defined, and it is indicated that only nouns and verbs really pose challenges for the lemmatization of Setswana. The computational approach taken in this research, and the implementation applied, which use FSA 6, are described at length. Preliminary results indicate that the rules for nouns and verbs are rather accurate, with precision scores of 93–94% obtained in a small, contained experiment. The article concludes with a discussion...
Databáze: OpenAIRE