Creation of an annotated corpus of Old and Middle Hungarian court records and private correspondence
Autor: | Katalin Gugán, Adrienne Dömötör, Attila Novák, Mónika Varga |
---|---|
Rok vydání: | 2017 |
Předmět: |
Normalization (statistics)
Linguistics and Language Interface (Java) Computer science 02 engineering and technology Library and Information Sciences computer.software_genre Language and Linguistics Education Annotation 0202 electrical engineering electronic engineering information engineering 060201 languages & linguistics Parsing Information retrieval business.industry Vernacular Statistical model 06 humanities and the arts Metadata 0602 languages and literature 020201 artificial intelligence & image processing Artificial intelligence Computational linguistics business computer Natural language processing |
Zdroj: | Language Resources and Evaluation. 52:1-28 |
ISSN: | 1574-0218 1574-020X |
DOI: | 10.1007/s10579-017-9393-8 |
Popis: | The paper introduces a novel annotated corpus of Old and Middle Hungarian (16–18 century), the texts of which were selected in order to approximate the vernacular of the given historical periods as closely as possible. The corpus consists of testimonies of witnesses in trials and samples of private correspondence. The texts are not only analyzed morphologically, but each file contains metadata that would also facilitate sociolinguistic research. The texts were segmented into clauses, manually normalized and morphosyntactically annotated using an annotation system consisting of the PurePos PoS tagger and the Hungarian morphological analyzer HuMor originally developed for Modern Hungarian but adapted to analyze Old and Middle Hungarian morphological constructions. The automatically disambiguated morphological annotation was manually checked and corrected using an easy-to-use web-based manual disambiguation interface. The normalization process and the manual validation of the annotation required extensive teamwork and provided continuous feedback for the refinement of the computational morphology and iterative retraining of the statistical models of the tagger. The paper discusses some of the typical problems that occurred during the normalization procedure and their tentative solutions. Besides, we also describe the automatic annotation tools, the process of semi-automatic disambiguation, and the query interface, a special function of which also makes correction of the annotation possible. Displaying the original, the normalized and the parsed versions of the selected texts, the beta version of the first fully normalized and annotated historical corpus of Hungarian is freely accessible at the address http://tmk.nytud.hu/ . |
Databáze: | OpenAIRE |
Externí odkaz: |