On OCR ground truths and OCR post-correction gold standards, tools and formats

Autor: Reynaert, M., Antonacopoulos, A., Schulz, K.U.
Přispěvatelé: Antonacopoulos, A., Schulz, K.U., Creative Computing
Rok vydání: 2014
Předmět:
Zdroj: Antonacopoulos, A.; Schulz, K.U. (ed.), Proceedings of Datech 2014, Biblioteca National de España, Madrid, pp. 159-166
DATeCH
Proceedings of Digital Access to Textual Cultural Heritage, Datech 2014, 159-166
STARTPAGE=159;ENDPAGE=166;TITLE=Proceedings of Digital Access to Textual Cultural Heritage, Datech 2014
Antonacopoulos, A.; Schulz, K.U. (ed.), Proceedings of Datech 2014, Biblioteca National de España, Madrid, 159-166. New York : ACM
STARTPAGE=159;ENDPAGE=166;TITLE=Antonacopoulos, A.; Schulz, K.U. (ed.), Proceedings of Datech 2014, Biblioteca National de España, Madrid
Popis: We give an overview of activities undertaken in the sidelines of our automatic OCR post-correction core business over the past few years. We present ongoing projects in the Netherlands in which Text-Induced Corpus Clean-up plays a part. We describe the infrastructure we are building to help improve the overall text quality of large digitized text collections. We provide information on the tools we develop to facilitate the process and discuss the role of FoLiA XML which we adopted as a pivot format. Connecting the dots, we discuss the difference we perceive between OCR ground truths and OCR post-correction gold standards and their respective contributions.
Databáze: OpenAIRE