Exploiting Large Unlabeled Data in Automatic Evaluation of Coherence in Czech
Autor: | Kateřina Rysová, Michal Novák, Jiří Mírovský, Magdaléna Rysová |
---|---|
Rok vydání: | 2019 |
Předmět: |
Czech
Evaluation system Exploit business.industry Computer science media_common.quotation_subject 05 social sciences 050301 education 02 engineering and technology Coherence (statistics) Automated essay scoring computer.software_genre language.human_language Margin (machine learning) 0202 electrical engineering electronic engineering information engineering language 020201 artificial intelligence & image processing Quality (business) Artificial intelligence Language model business 0503 education computer Natural language processing media_common |
Zdroj: | Text, Speech, and Dialogue ISBN: 9783030279462 TSD |
DOI: | 10.1007/978-3-030-27947-9_17 |
Popis: | The paper contributes to the research on automatic evaluation of surface coherence in student essays. We look into possibilities of using large unlabeled data to improve quality of such evaluation. Particularly, we propose two approaches to benefit from the large data: (i) n-gram language model, and (ii) density estimates of features used by the evaluation system. In our experiments, we integrate these approaches that exploit data from the Czech National Corpus into the evaluator of surface coherence for Czech, the EVALD system, and test its performance on two datasets: essays written by native speakers (L1) as well as foreign learners of Czech (L2). The system implementing these approaches together with other new features significantly outperforms the original EVALD system, especially on L1 with a large margin. |
Databáze: | OpenAIRE |
Externí odkaz: |