DDSpell - A Data Driven Spell Checker and Suggestion Generator for the Tamil Language

Autor: Keerthana Uthayamoorthy, K. Sarveswaran, Gihan Dias, Thavarasa Senthaalan, Kirshika Kanthasamy
Rok vydání: 2019
Předmět:
Zdroj: 2019 19th International Conference on Advances in ICT for Emerging Regions (ICTer).
DOI: 10.1109/icter48817.2019.9023698
Popis: DDSpell is a spell checker and suggestion generator for the Tamil language that is developed using a data-driven and language-independent approach. The proposed spell checker and suggestion generator can be used to check non-word errors and canti errors. Tamil does not have a full-fledged spell checker for common usage; this is the case for many of the South Asian Languages. Though there are several attempts have been made in developing spell checkers and correctors for Tamil, not many successes are reported; more importantly, most of them are only on papers and not available for the public to download or use. This paper outlines an approach using which a spell checker can be easily developed for Tamil like languages which share similar word-level structure. A dictionary of 4.0M Tamil words which was created from various sources is used to check the spelling. A character level bi-gram similarity matching, minimum edit distance measures and word frequencies were used to make suggestions for misspelled words. Besides, the techniques like hash keys and hash table were used to improve the processing speed of spell checking and suggestion generation. Since there were no benchmark datasets available to evaluate Tamil spell checkers, a benchmark dataset was also compiled systematically as a part of this research. DDSpell and the existing Tamil spell checkers were evaluated using the dataset and it was found that DDSpell outperformed other tools and showed the accuracy of 98.4%. DDSpell was extended to the Sinhala language to prove that it can be used to similar languages. A dictionary of 1.1M unique Sinhala words was created from various corpora and stored for the usage. The Sinhala spell checker showed an accuracy of 98% for a random dataset. The link of the DDSpell is made available via GitHub to the general public to use and test. Further, DDSpell is being extended to add more languages and features like API access.
Databáze: OpenAIRE