Evaluating the Robustness of Embedding-Based Topic Models to OCR Noise

Autor: Zosa, Elaine, Mutuvi, Stephen, Granroth-Wilding, Mark, Doucet, Antoine
Přispěvatelé: Ke, Hao-Ren, Lee, Chei Sian, Sugiyama, Kazunari, Department of Computer Science, Discovery Research Group/Prof. Hannu Toivonen, University of Helsinki, Laboratoire Informatique, Image et Interaction - EA 2118 (L3I), Université de La Rochelle (ULR), Multimedia University (MMU), Hao-Ren Ke, Chei Sian Lee, Kazunari Sugiyama
Rok vydání: 2021
Předmět:
Topic model
word embeddings
Computer science
Speech recognition
02 engineering and technology
topic modelling
[INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL]
[INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI]
OCR noise
[INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG]
Robustness (computer science)
0202 electrical engineering
electronic engineering
information engineering

[INFO.INFO-DL]Computer Science [cs]/Digital Libraries [cs.DL]
[INFO.INFO-HC]Computer Science [cs]/Human-Computer Interaction [cs.HC]
060201 languages & linguistics
06 humanities and the arts
113 Computer and information sciences
[INFO.INFO-TT]Computer Science [cs]/Document and Text Processing
Noise
OCR
ComputingMethodologies_PATTERNRECOGNITION
[INFO.INFO-IR]Computer Science [cs]/Information Retrieval [cs.IR]
0602 languages and literature
ComputerApplications_GENERAL
Embedding
020201 artificial intelligence & image processing
Zdroj: International Conference on Asian Digital Libraries (ICADL)
Towards Open and Trustworthy Digital Societies. 23rd International Conference on Asia-Pacific Digital Libraries, ICADL 2021, Virtual Event, December 1–3, 2021, Proceedings
Hao-Ren Ke; Chei Sian Lee; Kazunari Sugiyama. Towards Open and Trustworthy Digital Societies. 23rd International Conference on Asia-Pacific Digital Libraries, ICADL 2021, Virtual Event, December 1–3, 2021, Proceedings, 13133, Springer, pp.392-400, 2021, Lecture Notes in Computer Science, 978-3-030-91668-8. ⟨10.1007/978-3-030-91669-5_30⟩
Lecture Notes in Computer Science
Lecture Notes in Computer Science-Towards Open and Trustworthy Digital Societies
Lecture Notes in Computer Science ISBN: 9783030916688
ISSN: 0302-9743
1611-3349
DOI: 10.1007/978-3-030-91669-5_30
Popis: International audience; Unsupervised topic models such as Latent Dirichlet Allocation (LDA) are popular tools to analyse digitised corpora. However, the performance of these tools have been shown to degrade with OCR noise. Topic models that incorporate word embeddings during inference have been proposed to address the limitations of LDA, but these models have not seen much use in historical text analysis. In this paper we explore the impact of OCR noise on two embedding-based models, Gaussian LDA and the Embedded Topic Model (ETM) and compare their performance to LDA. Our results show that these models, especially ETM, are slightly more resilient than LDA in the presence of noise in terms of topic quality and classification accuracy.
Databáze: OpenAIRE