Table of Contents Recognition in OCR Documents using Image-based Machine Learning
Autor: | Nelson Zange Tsaku, Mingon Kang, Pritesh Patel, Tanju Bayramoglu, Sai Chandra Kosaraju, Girish Modgil |
---|---|
Rok vydání: | 2019 |
Předmět: |
Structure (mathematical logic)
business.industry Computer science TheoryofComputation_GENERAL Optical character recognition Document analysis computer.software_genre Machine learning Image (mathematics) ComputingMethodologies_DOCUMENTANDTEXTPROCESSING Table of contents Artificial intelligence business computer Image based |
Zdroj: | ACM Southeast Regional Conference |
Popis: | The importance of automatic analysis of Optical Character Recognition (OCR) documents has been increasingly recognized to assist with efficient data managements and accessibility. However, most OCR documents are unstructured, making the analysis extremely challenging. A document's Table Of Contents (TOC) provides an overall structure of a document, such as chapters and appendixes. Hence, TOC recognition enables more effect analyze OCR documents effectively. Most existing related works are based on textual features, such as keywords and font sizes. However, textual-based TOC recognition in OCR often fail when OCR documents are complex. In this study, we develop a novel image-based machine learning approach for recognition of TOC, where one-dimensional horizontal projections of TOC are proposed for classifying TOC and non-TOC. To the best of our knowledge, this is the first work to recognize TOC by image-based analysis. We evaluated the proposed methods with PDF documents of thesis and dissertations. The experimental results show that our proposed methods outperformed others. |
Databáze: | OpenAIRE |
Externí odkaz: |