Mending Fractured Texts:A Heuristic Procedure for Correcting OCR data

Autor: Philip Diderichsen, Jens Bjerring-Hansen, Ross Deans Kristensen-McLachlan, Dorte Haltrup Hansen
Jazyk: angličtina
Rok vydání: 2022
Předmět:
Zdroj: Bjerring-Hansen, J, Deans Kristensen-McLachlan, R, Diderichsen, P & Hansen, D H 2022, Mending Fractured Texts : A Heuristic Procedure for Correcting OCR data . in Proceedings of the 6th Digital Humanities in the Nordic and Baltic Countries Conference (DHNB 2022). Uppsala, Sweden, March 15-18, 2022. . Uppsala, DHNB PROCEEDINGS, no. 3232, pp. 177-186 . < http://ceur-ws.org/Vol-3232/paper14.pdf >
Aarhus University
Bjerring-Hansen, J, Kristensen-McLachlan, R D, Diderichsen, P & Hansen, D H 2022, ' Mending Fractured Texts. A heuristic procedure for correcting OCR data ', CEUR Workshop Proceedings, vol. 3232, pp. 177-186 . < https://ceur-ws.org/Vol-3232/paper14.pdf >
Diderichsen, P, Bjerring-Hansen, J, Kristensen-McLachlan, R D & Haltrup Hansen, D 2022, ' Mending Fractured Texts. A heuristic procedure for correcting OCR data ', Digital Humanities in the Nordic and Baltic Countries Conference (DHNB 2022), Uppsala, Sweden, 15/03/2022-18/03/2022 .
Popis: In this paper we present an evaluation pipeline comparing different methods ofOptical Character Recognition (OCR) of 19th century printed fraktur(gothic/blackletter) as well as a correction pipeline, which combines re-OCRingand language technology. The work has been carried out at the University ofCopenhagen in relation to a research project involving digital explorations of acorpus of some 900 Danish and Norwegian novels from 1870 to 1899, totallingapp. 50 million words. Roughly 25 % of these are printed in the traditionalfraktur font, which was almost totally dominating in the beginning of the 19thcentury. These texts are important culturally, since they represent mostlyforgotten, popular novels, however they pose technical and methodologicalchallenges in terms of processing the text from printed page to digital corpus.In order to provide the best possible material for digital literary analysis as wellas more linguistic studies, we designed a handcrafted OCR correction pipelinefor the fraktur part of the corpus consisting of several different heuristiccorrection steps, with reference to a gold standard. The first step is apreprocessing step which takes care of obvious and unambiguous OCR errors.In the second step, we align our primary OCR output candidate (the outputfrom Tesseract using the “Fraktur.traineddata” pretrained OCR model) withseveral other OCR output candidates and perform context-sensitive correctionwith reference to these. Especially the Danish “æ” and “ø” characters can besuccessfully recovered with reference to the Danish, non-fraktur“dan.traineddata” Tesseract model. Finally, in the third step, we employ theSymSpell algorithm (https://github.com/wolfgarbe/SymSpell) to performspelling correction backed by a word form dictionary hand-crafted from variousrelevant sources. The pipeline yields an improvement in word error rate fromabout 11% (89% correctly recognized word forms) to about 3% (97% correctlyrecognized word forms).
Databáze: OpenAIRE