Construction and annotation of a corpus of contemporary Nepali

Autor:	Pat Hall, Tony McEnery, Andrew Hardie, Ram Raj Lohani, Amar Gurung, Bhim Narayan Regmi, Srishtee Gurung, Yogendra P. Yadava, Jens Allwood
Rok vydání:	2008
Předmět:	Text corpus Linguistics and Language Nepali Computer science computer.internet_protocol business.industry Speech corpus computer.software_genre Unicode Language and Linguistics Linguistics language.human_language Annotation Corpus linguistics language Artificial intelligence business computer XML Natural language processing Spoken language
Zdroj:	Corpora. 3:213-225
ISSN:	1755-1676 1749-5032
DOI:	10.3366/e1749503208000166
Popis:	In this paper, we describe the construction of the 14-million-word Nepali National Corpus (NNC). This corpus includes both spoken and written data, the latter incorporating a Nepali match for FLOB and a broader collection of text. Additional resources within the NNC include parallel data (English–Nepali and Nepali–English) and a speech corpus. The NNC is encoded as Unicode text and marked up in CES-compatible XML. The whole corpus is also annotated with part-of-speech tags. We describe the process of devising a tagset and retraining tagger software for the Nepali language, for which there were no existing corpus resources. Finally, we explore some present and future applications of the corpus, including lexicography, NLP, and grammatical research.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::6e45a17a5a2630b380c09ca322df6dc6 https://doi.org/10.3366/e1749503208000166 Zobrazit plný text záznamu