Identifying relations between characters in Afrikaans, Tshivenḓa, and Xitsonga books

Autor: Maxwell Ramukhadi, Phathushedzo, Mlambo, Respect, Trollip, Benito, van Zaanen, Menno
Jazyk: angličtina
Rok vydání: 2020
Předmět:
DOI: 10.17613/z8m1-dq32
Popis: The usefulness of computational linguistic tools, such as named entity recognition (NER) systems, in linguistic or literary studies of under-resourced languages is an area that is still relatively unexplored. In this study the CTexTools2 NER system, which perform NER on all official South African languages, are applied to one Afrikaans novel and two scanned dramas, one in Tshivenḓa and one in Xitsonga. Next, personal relations are identified through character name co-occurence in sentences and these relationships are visualized using Gephi. The research identified several practical problems. First, the NCHLT Optical Character Recognition (OCR) system for South African languages was used to extract text from the Tshivenḓa and Xitsonga scans. However, the resulting quality turned out to be relatively low. We expect that one of the reasons is that the language model of the OCR system was trained on limited amounts of data. Second, the quality of the NER was also low (for all three languages), not only due to limited amounts of training data, but also because the recognizers were trained on government data, which typically has few named entities. Third, we found that in Tshivenḓa there are different ways of addressing people. This is indicated using the prefix Vho-. For example, Vho-Tshibovhola (formal) and Tshibovhola (informal) refer to the same person. Finally, the identification of the relationships between characters turned out to be difficult. All texts are relatively short and hence not many sentences contain more than one character. However, also the type of text has an effect here. Whereas the Afrikaans novel contains some multi-character sentences, the Tshivenḓa and Xitsonga plays hardly contained any. This shows that this approach of relationship finding may not be very well suited for plays. To try to resolve this issue, we incorporate the speaker as a character for each sentence, resulting in more extensive relationship information.
Databáze: OpenAIRE