Evaluating the Robustness of Embedding-Based Topic Models to OCR Noise
Autor: | Zosa, Elaine, Mutuvi, Stephen, Granroth-Wilding, Mark, Doucet, Antoine |
---|---|
Přispěvatelé: | Ke, Hao-Ren, Lee, Chei Sian, Sugiyama, Kazunari, Department of Computer Science, Discovery Research Group/Prof. Hannu Toivonen, University of Helsinki, Laboratoire Informatique, Image et Interaction - EA 2118 (L3I), Université de La Rochelle (ULR), Multimedia University (MMU), Hao-Ren Ke, Chei Sian Lee, Kazunari Sugiyama |
Rok vydání: | 2021 |
Předmět: |
Topic model
word embeddings Computer science Speech recognition 02 engineering and technology topic modelling [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL] [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI] OCR noise [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG] Robustness (computer science) 0202 electrical engineering electronic engineering information engineering [INFO.INFO-DL]Computer Science [cs]/Digital Libraries [cs.DL] [INFO.INFO-HC]Computer Science [cs]/Human-Computer Interaction [cs.HC] 060201 languages & linguistics 06 humanities and the arts 113 Computer and information sciences [INFO.INFO-TT]Computer Science [cs]/Document and Text Processing Noise OCR ComputingMethodologies_PATTERNRECOGNITION [INFO.INFO-IR]Computer Science [cs]/Information Retrieval [cs.IR] 0602 languages and literature ComputerApplications_GENERAL Embedding 020201 artificial intelligence & image processing |
Zdroj: | International Conference on Asian Digital Libraries (ICADL) Towards Open and Trustworthy Digital Societies. 23rd International Conference on Asia-Pacific Digital Libraries, ICADL 2021, Virtual Event, December 1–3, 2021, Proceedings Hao-Ren Ke; Chei Sian Lee; Kazunari Sugiyama. Towards Open and Trustworthy Digital Societies. 23rd International Conference on Asia-Pacific Digital Libraries, ICADL 2021, Virtual Event, December 1–3, 2021, Proceedings, 13133, Springer, pp.392-400, 2021, Lecture Notes in Computer Science, 978-3-030-91668-8. ⟨10.1007/978-3-030-91669-5_30⟩ Lecture Notes in Computer Science Lecture Notes in Computer Science-Towards Open and Trustworthy Digital Societies Lecture Notes in Computer Science ISBN: 9783030916688 |
ISSN: | 0302-9743 1611-3349 |
DOI: | 10.1007/978-3-030-91669-5_30 |
Popis: | International audience; Unsupervised topic models such as Latent Dirichlet Allocation (LDA) are popular tools to analyse digitised corpora. However, the performance of these tools have been shown to degrade with OCR noise. Topic models that incorporate word embeddings during inference have been proposed to address the limitations of LDA, but these models have not seen much use in historical text analysis. In this paper we explore the impact of OCR noise on two embedding-based models, Gaussian LDA and the Embedded Topic Model (ETM) and compare their performance to LDA. Our results show that these models, especially ETM, are slightly more resilient than LDA in the presence of noise in terms of topic quality and classification accuracy. |
Databáze: | OpenAIRE |
Externí odkaz: |