Categorizing Document Images into Script and Language Classes

Autor: Sabine Bergler, A. Bloch, Nicola Nobile, C. P. Nadal, B. Waked, Ching Y. Suen
Rok vydání: 1999
Předmět:
Zdroj: International Conference on Advances in Pattern Recognition ISBN: 9781447112143
DOI: 10.1007/978-1-4471-0833-7_30
Popis: In order to properly archive and index large numbers of international documents, several challenging processing steps must be completed even before optical character recognition (OCR) can be applied. We present a system that preclassifies documents for further processing and OCR. The system operates in four phases: preprocessing (including skew detection, segmentation, and noise removal), script (Latin, Arabic, Ideographic, or Cyrillic) classification, shape coding, and language classification for seven European languages.
Databáze: OpenAIRE