A classification-based approach to the identification of Multiword Expressions (MWEs) in Magahi Applying SVM
Autor: | Girish Nath Jha, Shivek Kumar, Pitambar Behera |
---|---|
Rok vydání: | 2017 |
Předmět: |
Hindi
0209 industrial biotechnology business.industry Computer science Principle of compositionality 02 engineering and technology computer.software_genre language.human_language Support vector machine Identification (information) Annotation 020901 industrial engineering & automation Classifier (linguistics) 0202 electrical engineering electronic engineering information engineering language General Earth and Planetary Sciences 020201 artificial intelligence & image processing Artificial intelligence business computer Natural language Natural language processing Word (computer architecture) General Environmental Science |
Zdroj: | KES |
ISSN: | 1877-0509 |
DOI: | 10.1016/j.procs.2017.08.059 |
Popis: | Multiword Expressions are crucial for any Natural Language Processing task as they frequently occur in any natural language. In addition, they “display a continuum of compositionality”. Although they have much frequency in informal spoken corpus, they are used less frequently in formal textual corpus. Multiword expressions in Magahi can provide a unique platform and a gateway to research into other less-resourced Indian languages in general and dialectal varieties of Hindi in particular. This is the very first research project of its kind undertaken in Magahi. In this study, we have applied Support Vector Machines classifier for automatic identification and classification of multiword expressions. For this purpose, we have applied a POS-annotated corpus of approximately 75k word tokens out of which 11k tokens are multiword expressions. The raw data applied in this study have been crawled and sanitized by Indian languages crawler known as IC Crawler and semi-automatically annotated by the ILCI annotation tool. The tagset adhered for annotation comprises of nine annotation labels as adapted from Singh et al. The Magahi multiword extractor achieves a combined overall precision accuracy of 81.57%. |
Databáze: | OpenAIRE |
Externí odkaz: |