Popis: |
Document layout analysis is a key step in document image understanding with wide applications in document digitization and reformatting. Identifying correct layout from noisy scanned images is especially challenging. In this paper, we introduce a semi-supervised learning framework to detect text-lines from noisy document images. Our framework consists of three steps. The first step is the initial segmentation that extracts text-lines and images using simple morphological operations. The second step is a grouping-based layout analysis that identifies text-lines, image zones, column separator and vertical border noise. It is able to efficiently remove the vertical border noises from multi-column pages. The third step is an online classifier that is trained with the high confidence line detection results from Step Two, and filters out noise from low confidence lines. The classifier effectively removes speckle noises embedded inside the content zones. We compare the performance of our algorithm to the state-of-the-art work in the field on the UW-III database. We choose the results reported by the Image Understanding Pattern Recognition Research (IUPR) and Scansoft Omnipage SDK 15.5. We evaluate the performances at both the page frame level and the text-line level. The result shows that our system has much lower false-alarm rate, while maintains similar content detection rate. In addition, we also show that our online training model generalizes better than algorithms depending on offline training. |