Arabic part-of-speech tagging using a combined rule-based and data-driven approach.

Autor: Sabtan, Yasser
Předmět:
Zdroj: Digital Scholarship in the Humanities; Sep2021, Vol. 36 Issue 3, p719-735, 17p
Abstrakt: This article presents a hybrid approach to part-of-speech tagging for undiacritized (or unvocalized) Arabic text which avoids the need for a large training set of manually tagged material. The approach is hybrid as it combines both rule-based and statistical (or data-driven) techniques. The key idea is that the training data are obtained by applying a rule-based tagger to a corpus of diacritized (or vocalized) text; a small subset of the output of the rule-based tagger is then hand-corrected, which is a much easier task than annotating it from scratch, and the results of this process are used as the training data for tagging undiacritized text. The advantage of this approach is that it requires very little manual effort. The only manual intervention is in the correction of the original training set. The accuracy obtained with this method is comparable to other state-of-the-art taggers for Arabic. [ABSTRACT FROM AUTHOR]
Databáze: Complementary Index