Training a whole-book LSTM-based recognizer with an optimal training set

Autor:	Didier Stricker, Ehsanollah Kabir, Mohammad Reza Soheili, Mohammad Reza Yousefi
Rok vydání:	2018
Předmět:	Long short term memory Training set Computer science business.industry Font Redundancy (engineering) Artificial intelligence Cluster analysis Machine learning computer.software_genre business computer
Zdroj:	ICMV
DOI:	10.1117/12.2309615
Popis:	Despite the recent progress in OCR technologies, whole-book recognition, is still a challenging task, in particular in case of old and historical books, that the unknown font faces or low quality of paper and print contributes to the challenge. Therefore, pre-trained recognizers and generic methods do not usually perform up to required standards, and usually the performance degrades for larger scale recognition tasks, such as of a book. Such reportedly low error-rate methods turn out to require a great deal of manual correction. Generally, such methodologies do not make effective use of concepts such redundancy in whole-book recognition. In this work, we propose to train Long Short Term Memory (LSTM) networks on a minimal training set obtained from the book to be recognized. We show that clustering all the sub-words in the book, and using the sub-word cluster centers as the training set for the LSTM network, we can train models that outperform any identical network that is trained with randomly selected pages of the book. In our experiments, we also show that although the sub-word cluster centers are equivalent to about 8 pages of text for a 101- page book, a LSTM network trained on such a set performs competitively compared to an identical network that is trained on a set of 60 randomly selected pages of the book.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::24c8bc023e3b5cf0cd6c0c65c102440b https://doi.org/10.1117/12.2309615 Zobrazit plný text záznamu