The GLAUx corpus: methodological issues in designing a long-term, diverse, multi-layered corpus of Ancient Greek
Autor: | Alek Keersmaekers |
---|---|
Rok vydání: | 2021 |
Předmět: |
History
media_common.quotation_subject InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL Ambiguity Ancient Greek Linguistics language.human_language Term (time) Greek language Variation (linguistics) ComputingMethodologies_DOCUMENTANDTEXTPROCESSING language Encoding (semiotics) Architecture media_common |
Zdroj: | Proceedings of the 2nd International Workshop on Computational Approaches to Historical Language Change 2021. |
DOI: | 10.18653/v1/2021.lchange-1.6 |
Popis: | This paper describes the GLAUx project (“the Greek Language Automated”), an ongoing effort to develop a large long-term diachronic corpus of Greek, covering sixteen centuries of literary and non-literary material annotated with NLP methods. After providing an overview of related corpus projects and discussing the general architecture of the corpus, it zooms in on a number of larger methodological issues in the design of historical corpora. These include the encoding of textual variants, handling extralinguistic variation and annotating linguistic ambiguity. Finally, the long- and short-term perspectives of this project are discussed. |
Databáze: | OpenAIRE |
Externí odkaz: |