Convolutional Neural Networks for Figure Extraction in Historical Technical Documents
Autor: | Chun-Nam Yu, Iraj Saniee, Caleb Levy |
---|---|
Rok vydání: | 2017 |
Předmět: |
Information retrieval
Training set Computer science 02 engineering and technology 010501 environmental sciences Technical documentation 01 natural sciences Convolutional neural network Task (project management) Data set Test set 0202 electrical engineering electronic engineering information engineering Key (cryptography) 020201 artificial intelligence & image processing Precision and recall 0105 earth and related environmental sciences |
Zdroj: | ICDAR |
DOI: | 10.1109/icdar.2017.134 |
Popis: | We present a method of extracting figures and images from the pages of scanned documents, especially from technical research articles. Our approach is novel in two key ways. First, we treat this as a computer vision problem, and train convolutional neural networks to recognize figures in scanned pages. Second, we generate our training data from 'born-digital' structured documents, allowing us to automatically produce labels for our training set using PDF figure extractors. This avoids the otherwise tedious task of hand-labelling thousands of document pages. Our convolutional neural networks achieve precision and recall of close to 85% in identifying figures from a test set consisting of modern journal papers and conference proceedings, and obtain precision and recall above 80% on an application data set comprised of historical technical documents scanned from the Bell Labs Records. Our results show that models trained on digital documents transfer very well to historical scans. Finally, it is easy to extend our models to identify other document elements such as tables and captions. |
Databáze: | OpenAIRE |
Externí odkaz: |