Investigation and modeling of the structure of texting language

Autor:	Monojit Choudhury, Vijit Jain, Sudeshna Sarkar, Rahul Saraf, Anupam Basu, Animesh Mukherjee
Rok vydání:	2007
Předmět:	Computer science business.industry Bigram Speech recognition computer.software_genre Electronic mail Computer Science Applications Pattern recognition (psychology) Standard English Text normalization Computer Vision and Pattern Recognition Artificial intelligence Language model Hidden Markov model business computer Software Natural language processing Word (computer architecture)
Zdroj:	International Journal of Document Analysis and Recognition (IJDAR). 10:157-174
ISSN:	1433-2825 1433-2833
DOI:	10.1007/s10032-007-0054-0
Popis:	Language usage over computer mediated discourses, such as chats, emails and SMS texts, significantly differs from the standard form of the language and is referred to as texting language (TL). The presence of intentional misspellings significantly decrease the accuracy of existing spell checking techniques for TL words. In this work, we formally investigate the nature and type of compressions used in SMS texts, and develop a Hidden Markov Model based word-model for TL. The model parameters have been estimated through standard machine learning techniques from a word-aligned SMS and standard English parallel corpus. The accuracy of the model in correcting TL words is 57.7%, which is almost a threefold improvement over the performance of Aspell. The use of simple bigram language model results in a 35% reduction of the relative word level error rates.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::d233164bce4fb60f05c782542f43306b https://doi.org/10.1007/s10032-007-0054-0 Zobrazit plný text záznamu Full text from SpringerLink