Optical character recognition for South African languages

Autor: Martin Puttkammer, Justin Hocking
Rok vydání: 2016
Předmět:
Zdroj: 2016 Pattern Recognition Association of South Africa and Robotics and Mechatronics International Conference (PRASA-RobMech).
DOI: 10.1109/robomech.2016.7813139
Popis: Optical Character Recognition (OCR) is an essential technology in the digitisation of printed media. Many OCR engines are language-specific and available for common languages such as English and other European languages, but less so for smaller languages such as Tshivenda and the other South African languages. In this paper, we describe the process of training OCR engines for South African languages using Tesseract, and compare the accuracy of these engines with the accuracy of using an English engine on the other South African languages. We find that our language-specific engines achieve a high accuracy, with 50% less errors as when using an English engine.
Databáze: OpenAIRE