AN ALGORITHM FOR MATCHING OCR-GENERATED TEXT STRINGS

Autor: Junichi Kanai, Stephen V. Rice, Thomas A. Nartker
Rok vydání: 1994
Předmět:
Zdroj: Document Image Analysis
ISSN: 1793-6381
0218-0014
DOI: 10.1142/s0218001494000632
Popis: When optical character recognition (OCR) devices process the same page image, they generate similar text strings. Differences are due to recognition errors. A page of text rarely contains long repeated substrings; therefore, N strings generated by OCR devices can be quickly matched by detecting long common substrings. An algorithm for matching an arbitrary number of strings based on this principle is presented. Although its worst-case performance is O(Nn2), its performance in practice has been observed to be O(Nn log n), where n is the length of a string. This algorithm has been successfully used to study OCR errors, to determine the accuracy of OCR devices, and to implement a voting algorithm.
Databáze: OpenAIRE