Automatic metadata extraction via image processing using Migne's Patrologia Graeca

Autor: Varthis, Evagelos, Poulos, Marios, Giarenis, Ilias, Papavlasopoulos, Sozon
Zdroj: International Journal of Metadata, Semantics and Ontologies; 2020, Vol. 14 Issue: 4 p265-278, 14p
Abstrakt: A wealth of knowledge is kept in libraries and cultural institutions in various digital forms without, however, the possibility of a simple term search, let alone of a substantial semantic search. In this study, a novel approach is proposed which strives to recognise words and automatically generate metadata from large machine-printed corpora such as Migne's Patrologia Graeca (PG). The proposed framework firstly applies an efficient word segmentation and then transforms the word-images into special compact shapes. For the comparison, we use Hu's invariant moments for discarding unlikely similar matches, Shape Context (SC) for the contour similarity and the Pearson's Correlation Coefficient (PCC) for final verification. Comparative results are presented by using the Long-Short Term Memory (LSTM) Neural Network (NN) engine of Tesseract Optical Character Recognition (OCR) system instead of PCC. In addition, an intelligent scenario is proposed for automatic generation of PG metadata by librarians.
Databáze: Supplemental Index