Classification of crowdsourced text correction
Autor: | Haimonti Dutta, Megha Gupta, Brian Geiger |
---|---|
Rok vydání: | 2015 |
Předmět: | |
Zdroj: | CODS |
DOI: | 10.1145/2732587.2732619 |
Popis: | Optical Character Recognition (OCR) is a commonly used technique for digitizing printed material enabling them to be displayed online, searched and used in text mining applications. The text generated from OCR devices is often garbled due to variations in quality of the input paper, size and style of the font and column layout. This adversely affects retrieval effectiveness and hence techniques for cleaning the garbled text need to be improvised. This prototype system is expected to be deployed on historical newspaper archives that make extensive use of user text corrections. |
Databáze: | OpenAIRE |
Externí odkaz: |