The Open corpus of the Veps and Karelian languages: overview and applications

Autor: Boyko, Tatyana, Zaitseva, Nina, Krizhanovskaya, Natalia, Krizhanovsky, Andrew, Novak, Irina, Pellinen, Nataliya, Rodionova, Aleksandra
Rok vydání: 2022
Předmět:
Zdroj: KnE Social Sciences. 7 (3). 2022. P. 29-40
Druh dokumentu: Working Paper
DOI: 10.18502/kss.v7i3.10419
Popis: A growing priority in the study of Baltic-Finnic languages of the Republic of Karelia has been the methods and tools of corpus linguistics. Since 2016, linguists, mathematicians, and programmers at the Karelian Research Centre have been working with the Open Corpus of the Veps and Karelian Languages (VepKar), which is an extension of the Veps Corpus created in 2009. The VepKar corpus comprises texts in Karelian and Veps, multifunctional dictionaries linked to them, and software with an advanced system of search using various criteria of the texts (language, genre, etc.) and numerous linguistic categories (lexical and grammatical search in texts was implemented thanks to the generator of word forms that we created earlier). A corpus of 3000 texts was compiled, texts were uploaded and marked up, the system for classifying texts into languages, dialects, types and genres was introduced, and the word-form generator was created. Future plans include developing a speech module for working with audio recordings and a syntactic tagging module using morphological analysis outputs. Owing to continuous functional advancements in the corpus manager and ongoing VepKar enrichment with new material and text markup, users can handle a wide range of scientific and applied tasks. In creating the universal national VepKar corpus, its developers and managers strive to preserve and exhibit as fully as possible the state of the Veps and Karelian languages in the 19th-21st centuries.
Comment: 9 pages, 9 figures, published in the journal
Databáze: arXiv