A Fast Japanese Word Extraction with Classification to Similarly-Shaped Character Categories and Morphological Analysis

Autor: Masaharu Ozaki, Katsuhiko Itoniri
Rok vydání: 1999
Předmět:
Zdroj: Document Analysis Systems: Theory and Practice ISBN: 9783540665076
Document Analysis Systems
DOI: 10.1007/3-540-48172-9_17
Popis: A fast word extraction technique from Japanese document images is described. It classifies each character image not into characters but into categories consisting of similarly shaped characters. Morphological analysis is performed on the sequence of the categories to obtain word candidates. Detailed classification is performed on character images that cannot be identified as single characters. Multi-template methodology and hierarchical classification is combined to make the classifier accurate and fast with low dimensional vectors. As a result of the experiments for the learning samples, the accuracy of classification was 99.3% and the speed was eight times faster than traditional Japanese OCRs. As experimental results for the test samples made from forty newspaper articles, the classification speed is still eight times faster. The morphological analysis greatly decreased character candidates with the fact that 85% of characters were identified as single characters on the newspaper article images.
Databáze: OpenAIRE