Popis: |
In order to properly archive and index large numbers of international documents, several challenging processing steps must be completed even before optical character recognition (OCR) can be applied. We present a system that preclassifies documents for further processing and OCR. The system operates in four phases: preprocessing (including skew detection, segmentation, and noise removal), script (Latin, Arabic, Ideographic, or Cyrillic) classification, shape coding, and language classification for seven European languages. |