On OCR ground truths and OCR post-correction gold standards, tools and formats
Autor: | Reynaert, M., Antonacopoulos, A., Schulz, K.U. |
---|---|
Přispěvatelé: | Antonacopoulos, A., Schulz, K.U., Creative Computing |
Rok vydání: | 2014 |
Předmět: |
Ground truth
Core business Process (engineering) computer.internet_protocol Computer science media_common.quotation_subject 020206 networking & telecommunications 02 engineering and technology Tools/Integration. Digitale productiestraat World Wide Web ComputingMethodologies_DOCUMENTANDTEXTPROCESSING 0202 electrical engineering electronic engineering information engineering 020201 artificial intelligence & image processing Quality (business) Language & Speech Technology GeneralLiterature_REFERENCE(e.g. dictionaries encyclopedias glossaries) computer XML Aligned constructions in machine translation media_common |
Zdroj: | Antonacopoulos, A.; Schulz, K.U. (ed.), Proceedings of Datech 2014, Biblioteca National de España, Madrid, pp. 159-166 DATeCH Proceedings of Digital Access to Textual Cultural Heritage, Datech 2014, 159-166 STARTPAGE=159;ENDPAGE=166;TITLE=Proceedings of Digital Access to Textual Cultural Heritage, Datech 2014 Antonacopoulos, A.; Schulz, K.U. (ed.), Proceedings of Datech 2014, Biblioteca National de España, Madrid, 159-166. New York : ACM STARTPAGE=159;ENDPAGE=166;TITLE=Antonacopoulos, A.; Schulz, K.U. (ed.), Proceedings of Datech 2014, Biblioteca National de España, Madrid |
Popis: | We give an overview of activities undertaken in the sidelines of our automatic OCR post-correction core business over the past few years. We present ongoing projects in the Netherlands in which Text-Induced Corpus Clean-up plays a part. We describe the infrastructure we are building to help improve the overall text quality of large digitized text collections. We provide information on the tools we develop to facilitate the process and discuss the role of FoLiA XML which we adopted as a pivot format. Connecting the dots, we discuss the difference we perceive between OCR ground truths and OCR post-correction gold standards and their respective contributions. |
Databáze: | OpenAIRE |
Externí odkaz: |