Historical corpora meet the digital humanities: the Jerusalem Corpus of Emergent Modern Hebrew
Autor: | Aynat Rubinstein |
---|---|
Rok vydání: | 2019 |
Předmět: |
050101 languages & linguistics
Linguistics and Language Markup language Hebrew Computer science business.industry 05 social sciences Hebrew literature 02 engineering and technology Library and Information Sciences Crowdsourcing Language and Linguistics language.human_language Linguistics Education Metadata Digital humanities 0202 electrical engineering electronic engineering information engineering language 020201 artificial intelligence & image processing 0501 psychology and cognitive sciences Language model Computational linguistics business |
Zdroj: | Language Resources and Evaluation. 53:807-835 |
ISSN: | 1574-0218 1574-020X |
DOI: | 10.1007/s10579-019-09458-4 |
Popis: | The paper describes the creation of the first open access multi-genre historical corpus of Emergent Modern Hebrew, made possible by implementation of digital humanities methods in the process of corpus curation, encoding, and dissemination. Corpus contents originate in the Ben-Yehuda Project, an open access repository of Hebrew literature online, and in digital images curated from the collections of the National Library of Israel, a selection of which have been transcribed through a dedicated crowdsourcing task that feeds back into the library’s online catalog. Texts in the corpus are encoded following best practices in the digital humanities, including markup of metadata that enables time-sensitive research, linguistic and other, of the corpus. Evaluation of morphological analysis based on Modern Hebrew language models is shown to distinguish between genres in the historical variety, highlighting the importance of ephemeral materials for linguistic research and for potential collaboration with libraries and cultural institutions in the process of corpus creation. We demonstrate the use of the corpus in diachronic linguistic research and suggest ways in which the association it provides between digital images and texts can be used to support automatic language processing and to enhance resources in the digital humanities. |
Databáze: | OpenAIRE |
Externí odkaz: |