Text extraction in complex color documents

Autor:	Antonios Atsalakis, Nikos Papamarkos, C. Strouthopoulos
Rok vydání:	2002
Předmět:	Color histogram Color normalization Computer science business.industry Color image Binary image ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION Pattern recognition HSL and HSV Color quantization Web colors Artificial Intelligence Signal Processing Color depth High color ComputingMethodologies_DOCUMENTANDTEXTPROCESSING Computer vision Computer Vision and Pattern Recognition Artificial intelligence business Software
Zdroj:	Pattern Recognition. 35:1743-1758
ISSN:	0031-3203
DOI:	10.1016/s0031-3203(01)00167-4
Popis:	Text extraction in mixed-type documents is a pre-processing and necessary stage for many document applications. In mixed-type color documents, text, drawings and graphics appear with millions of different colors. In many cases, text regions are overlaid onto drawings or graphics. In this paper, a new method to automatically detect and extract text in mixed-type color documents is presented. The proposed method is based on a combination of an adaptive color reduction (ACR) technique and a page layout analysis (PLA) approach. The ACR technique is used to obtain the optimal number of colors and to convert the document into the principal of them. Then, using the principal colors, the document image is split into the separable color plains. Thus, binary images are obtained, each one corresponding to a principal color. The PLA technique is applied independently to each of the color plains and identifies the text regions. A merging procedure is applied in the final stage to merge the text regions derived from the color plains and to produce the final document. Several experimental and comparative results, exhibiting the performance of the proposed technique, are also presented.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::8d850ce8be7fb5de2951299b1e8f3613 https://doi.org/10.1016/s0031-3203(01)00167-4 Zobrazit plný text záznamu Full Text from ScienceDirect