Optical character recognition for South African languages
Autor: | Martin Puttkammer, Justin Hocking |
---|---|
Rok vydání: | 2016 |
Předmět: |
Engineering
business.industry Feature extraction Languages of Africa Thesaurus Optical character recognition computer.software_genre Optical character recognition software ComputingMilieux_GENERAL World Wide Web ComputingMethodologies_DOCUMENTANDTEXTPROCESSING Tesseract Artificial intelligence business computer Natural language processing Character recognition |
Zdroj: | 2016 Pattern Recognition Association of South Africa and Robotics and Mechatronics International Conference (PRASA-RobMech). |
DOI: | 10.1109/robomech.2016.7813139 |
Popis: | Optical Character Recognition (OCR) is an essential technology in the digitisation of printed media. Many OCR engines are language-specific and available for common languages such as English and other European languages, but less so for smaller languages such as Tshivenda and the other South African languages. In this paper, we describe the process of training OCR engines for South African languages using Tesseract, and compare the accuracy of these engines with the accuracy of using an English engine on the other South African languages. We find that our language-specific engines achieve a high accuracy, with 50% less errors as when using an English engine. |
Databáze: | OpenAIRE |
Externí odkaz: |