Updated Morphologically Annotated Corpora for 9 South African Languages

Autor: Tanja Gaustad, Cindy A. McKellar
Jazyk: angličtina
Rok vydání: 2024
Předmět:
Zdroj: Journal of Open Humanities Data, Vol 10, Pp 38-38 (2024)
Druh dokumentu: article
ISSN: 2059-481X
DOI: 10.5334/johd.211
Popis: The dataset described in this article presents converted and updated corpora for nine of the twelve official South African languages. After a revision of the morphological annotation protocols, the existing National Centre for Human Language Technology (NCHLT) corpora (Eiselen & Puttkammer, 2014) have been converted to updated morphological tags and consequently checked by linguistic experts for correctness. The resulting corpora are uniformly linguistically annotated for morphology across all nine languages, amounting to approximately 70,000 tokens for the five disjunctively written languages and 45,000 tokens for the four conjunctively written languages. The corpora are primarily aimed at the development and evaluation of Natural Language Processing (NLP) core technologies. In addition, the data can be used for language-specific and cross-language comparative corpus linguistic studies as well as corpus-based investigations of morphological phenomena in the included languages.
Databáze: Directory of Open Access Journals