A classification-based approach to the identification of Multiword Expressions (MWEs) in Magahi Applying SVM

Autor: Girish Nath Jha, Shivek Kumar, Pitambar Behera
Rok vydání: 2017
Předmět:
Zdroj: KES
ISSN: 1877-0509
DOI: 10.1016/j.procs.2017.08.059
Popis: Multiword Expressions are crucial for any Natural Language Processing task as they frequently occur in any natural language. In addition, they “display a continuum of compositionality”. Although they have much frequency in informal spoken corpus, they are used less frequently in formal textual corpus. Multiword expressions in Magahi can provide a unique platform and a gateway to research into other less-resourced Indian languages in general and dialectal varieties of Hindi in particular. This is the very first research project of its kind undertaken in Magahi. In this study, we have applied Support Vector Machines classifier for automatic identification and classification of multiword expressions. For this purpose, we have applied a POS-annotated corpus of approximately 75k word tokens out of which 11k tokens are multiword expressions. The raw data applied in this study have been crawled and sanitized by Indian languages crawler known as IC Crawler and semi-automatically annotated by the ILCI annotation tool. The tagset adhered for annotation comprises of nine annotation labels as adapted from Singh et al. The Magahi multiword extractor achieves a combined overall precision accuracy of 81.57%.
Databáze: OpenAIRE