Content-based text line comparison for historical document retrieval

Autor: Zinger, S., Nerbonne, J., Schomaker, L., Schie, H., Nicolov, N., Angelova, G., Mitkov, R.
Přispěvatelé: Signal Processing Systems, Biomedical Diagnostics Lab
Jazyk: angličtina
Rok vydání: 2007
Předmět:
Zdroj: Proceedings of the Recent Advances in Natural Language Processing Conference, RANLP-2007, 27-29 September 2007, Borovets, Bulgaria, 79-84
STARTPAGE=79;ENDPAGE=84;TITLE=Proceedings of the Recent Advances in Natural Language Processing Conference, RANLP-2007, 27-29 September 2007, Borovets, Bulgaria
Popis: In the historical handwritten document retrieval system that we are currently building, the training data set elements are the images of handwritten lines with the manually made text transcriptions. We apply sequence comparison algorithms to these text transcriptions. We explore several sequence comparison algorithms that have been applied to phonology for their usefulness in solving a problem of retrieving handwritten material. Finding an appropriate method for comparing text lines will allow us to cluster the corresponding images of handwritten lines into training sets. These training sets can then be used for pattern recognition - an important part of the historical handwritten document retrieval system. At first we study the information needs of the users of an archive where the historical documents are stored. Then we explore the longest common substring (LCS), Levenshtein and Jaccard measures for matching the text lines. Taking into account the drawbacks of these methods, we propose to weight the words in the text proportionally to their information content. This weighting is expected to provide results closer to the information needs of users. We evaluate the results in terms of the precision values for k top retrieved text lines. Using the mean precision curves we show that the performance of sequence comparisons increases up to 18% when we use the weighted sequence comparisons.
Databáze: OpenAIRE