A word extraction algorithm for machine-printed documents using a 3D neighborhood graph model
Autor: | Hwan-Chul Park, Hwan-Gue Cho, Young-Jung Yu, Se-Young Ok |
---|---|
Rok vydání: | 2001 |
Předmět: |
Connected component
Computer science business.industry Context (language use) computer.software_genre Document processing Computer Science Applications Character (mathematics) Minimum bounding box Pattern recognition (psychology) Computer vision Computer Vision and Pattern Recognition Artificial intelligence Line (text file) business computer Software Word (computer architecture) Natural language processing |
Zdroj: | International Journal on Document Analysis and Recognition. 4:115-130 |
ISSN: | 1433-2833 |
DOI: | 10.1007/pl00010903 |
Popis: | Automatic character recognition and image understanding of a given paper document are the main objectives of the computer vision field. For these problems, a basic step is to isolate characters and group words from these isolated characters. In this paper, we propose a new method for extracting characters from a mixed text/graphic machine-printed document and an algorithm for distinguishing words from the isolated characters. For extracting characters, we exploit several features (size, elongation, and density) of characters and propose a characteristic value for classification using the run-length frequency of the image component. In the context of word grouping, previous works have largely been concerned with words which are placed on a horizontal or vertical line. Our word grouping algorithm can group words which are on inclined lines, intersecting lines, and even curved lines. To do this, we introduce the 3D neighborhood graph model which is very useful and efficient for character classification and word grouping. In the 3D neighborhood graph model, each connected component of a text image segment is mapped onto 3D space according to the area of the bounding box and positional information from the document. We conducted tests with more than 20 English documents and more than ten oriental documents scanned from books, brochures, and magazines. Experimental results show that more than 95% of words are successfully extracted from general documents, even in very complicated oriental documents. |
Databáze: | OpenAIRE |
Externí odkaz: |