Convolutional Neural Networks for Figure Extraction in Historical Technical Documents

Autor: Chun-Nam Yu, Iraj Saniee, Caleb Levy
Rok vydání: 2017
Předmět:
Zdroj: ICDAR
DOI: 10.1109/icdar.2017.134
Popis: We present a method of extracting figures and images from the pages of scanned documents, especially from technical research articles. Our approach is novel in two key ways. First, we treat this as a computer vision problem, and train convolutional neural networks to recognize figures in scanned pages. Second, we generate our training data from 'born-digital' structured documents, allowing us to automatically produce labels for our training set using PDF figure extractors. This avoids the otherwise tedious task of hand-labelling thousands of document pages. Our convolutional neural networks achieve precision and recall of close to 85% in identifying figures from a test set consisting of modern journal papers and conference proceedings, and obtain precision and recall above 80% on an application data set comprised of historical technical documents scanned from the Bell Labs Records. Our results show that models trained on digital documents transfer very well to historical scans. Finally, it is easy to extend our models to identify other document elements such as tables and captions.
Databáze: OpenAIRE