Normalization of Non-Standard Words in Croatian Texts

Autor:	Beliga, Slobodan, Pobar, Miran, Martinčić-Ipšić, Sanda
Rok vydání:	2015
Předmět:	Computer Science - Computation and Language
Druh dokumentu:	Working Paper
Popis:	This paper presents text normalization which is an integral part of any text-to-speech synthesis system. Text normalization is a set of methods with a task to write non-standard words, like numbers, dates, times, abbreviations, acronyms and the most common symbols, in their full expanded form are presented. The whole taxonomy for classification of non-standard words in Croatian language together with rule-based normalization methods combined with a lookup dictionary are proposed. Achieved token rate for normalization of Croatian texts is 95%, where 80% of expanded words are in correct morphological form. Comment: 8 pages, 3 figures in Text, Speech and Dialogue extension to Lecture Notes in Artificial Intelligence LNAI6836. Hebernal, Ivan; Matou\v{s}ek, V\'aclav (ed). - Plzen: University of West Bohemia, 2011. 1-8 (ISBN: 987-80-261-0069-0)
Databáze:	arXiv
Externí odkaz:	http://arxiv.org/abs/1503.08167 Zobrazit plný text záznamu View this record from Arxiv