Semi-supervised learning for detecting text-lines in noisy document images

Autor:	Hanning Zhou, Zongyi Liu
Rok vydání:	2010
Předmět:	business.industry Computer science Supervised learning Pattern recognition Semi-supervised learning Mathematical morphology Speckle pattern Segmentation Computer vision Artificial intelligence Document retrieval business Classifier (UML) Document layout analysis Digitization
Zdroj:	DRR
ISSN:	0277-786X
Popis:	Document layout analysis is a key step in document image understanding with wide applications in document digitization and reformatting. Identifying correct layout from noisy scanned images is especially challenging. In this paper, we introduce a semi-supervised learning framework to detect text-lines from noisy document images. Our framework consists of three steps. The first step is the initial segmentation that extracts text-lines and images using simple morphological operations. The second step is a grouping-based layout analysis that identifies text-lines, image zones, column separator and vertical border noise. It is able to efficiently remove the vertical border noises from multi-column pages. The third step is an online classifier that is trained with the high confidence line detection results from Step Two, and filters out noise from low confidence lines. The classifier effectively removes speckle noises embedded inside the content zones. We compare the performance of our algorithm to the state-of-the-art work in the field on the UW-III database. We choose the results reported by the Image Understanding Pattern Recognition Research (IUPR) and Scansoft Omnipage SDK 15.5. We evaluate the performances at both the page frame level and the text-line level. The result shows that our system has much lower false-alarm rate, while maintains similar content detection rate. In addition, we also show that our online training model generalizes better than algorithms depending on offline training.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::58af3ef105905faa5cd61c2507fe3455 https://doi.org/10.1117/12.837362 Zobrazit plný text záznamu