Measuring verb similarity using binary coefficients with application to isiXhosa and isiZulu

Autor: C. Maria Keet, Zola Mahlaza
Rok vydání: 2018
Předmět:
Zdroj: SAICSIT
DOI: 10.1145/3278681.3278690
Popis: Natural Language Processing (NLP) for underresourced languages may benefit from a bootstrapping approach to utilise the sparse resources across closely related languages. This brings afore the question of language similarity, and therewith the question of how to measure that, so as to make informed predictions on potential success of bootstrapping. We present a method for measuring morphosyntactic similarity by developing Context Free Grammars (CFGs) for isiXhosa and isiZulu verb fragments that are relevant for the use case of weather forecast generation. We then investigate morphosyntactic similarity of the CFGs using parse tree analysis and four binary similarity measures. In particular, we selected four binary similarity measures from other domains and adapted them to our data, which are the word sets generated from the respective CFGs. The similarity measures together with the parse tree analysis are used to study the the extent to which both languages can be generated by a singular grammar fragment. The resulting 52 rules for isiXhosa and 49 rules for isiZulu overlap on 42 rules. This supports the existing intuition of similarity, as they are in the same language cluster. The morphosyntactic similarity measured with the binary coefficients reached 59.5% overall (adapted Driver-Kroeber), with 99.5% for the past tense only. This lower score cf. the structure of the CFG is attributable to the small differences in terminals in mainly the prefix of the verb. The parse tree analysis and binary similarity measures show that a modularised set of rules for the prefix, verb root, and suffix would allow the generation of the two languages with a single grammar where only the prefix requires differentiation.
Databáze: OpenAIRE