Latviešu valodas apguvēju korpusa datu ieguves un apstrādes metodoloģijas izstrāde

Autor: Ilze Auziņa, Kristīne Pokratniece, Kristīne Levāne-Petrova
Rok vydání: 2020
Zdroj: Valodu apguve: problēmas un perspektīva : zinātnisko rakstu krājums = Language Acquisition: Problems and Perspective : conference proceedings. :299-309
ISSN: 2661-5584
Popis: Popularity of learning Latvian as a foreign language is increasing. Latvian as a foreign language is being taught not only in the higher educational institutions of Latvia, but also in more than 20 universities outside Latvia (Šalme 2008; Šalme 2011; Laizāne 2019). Therefore, corpus-based and corpus-driven teaching materials are crucial for the international students that acquire Latvian both in Latvia and abroad. Since September 2018 the project Development of Learner Corpus of Latvian: methods, tools and applications have been carried out. During the project, based on the already existing experience of designing Learner Corpus of Latvian (LaVA), a corpus of students’ essays with different language backgrounds will be created. The newly developed corpus will be publicly available. Although the corpus creation pipeline includes text collection, digitization, and morphological and error annotation; this article will cover just the first phases of the creation of the corpus – the development of a methodology for data collection and digitization. The agreement form with data subject about data inclusion into the corpus and the metadata (gender, age, mother tongue, language proficiency level) collection form were developed. Guidelines for teachers on preferred topics have been prepared. The corpus is built on an integrated multifunctional platform that provides a single interface for uploading, annotating and search. Moreover, the web platform can also be used for storing scanned copies of essays, comparing texts entered by two independent digitizers, correcting texts, and error-annotated texts and making inter-annotator agreement. At least 1000 essays on different topics from students with different language backgrounds are planned to be included in the LaVA corpus. For data collection, multiple universities and language teachers have been contacted and have agreed to support the corpus creation process by providing it with their students’ previously developed assignments. Collected essays with metadata are handwritten; therefore, they need to be digitized for further data processing steps. The digitization is carried out in three steps: 1) scanning of the assignments and essays, 2) metadata input, 3) text rewriting in digital format. Scanned images of the assignments help to validate data correctness if such concern arises. Metadata is entered manually.
Databáze: OpenAIRE