DDSpell - A Data Driven Spell Checker and Suggestion Generator for the Tamil Language
Autor: | Keerthana Uthayamoorthy, K. Sarveswaran, Gihan Dias, Thavarasa Senthaalan, Kirshika Kanthasamy |
---|---|
Rok vydání: | 2019 |
Předmět: |
Generator (computer programming)
Computer science business.industry Hash function Spell 02 engineering and technology computer.software_genre language.human_language Spelling Hash table Word lists by frequency Tamil 0202 electrical engineering electronic engineering information engineering language 020201 artificial intelligence & image processing Edit distance Artificial intelligence business computer Natural language processing |
Zdroj: | 2019 19th International Conference on Advances in ICT for Emerging Regions (ICTer). |
DOI: | 10.1109/icter48817.2019.9023698 |
Popis: | DDSpell is a spell checker and suggestion generator for the Tamil language that is developed using a data-driven and language-independent approach. The proposed spell checker and suggestion generator can be used to check non-word errors and canti errors. Tamil does not have a full-fledged spell checker for common usage; this is the case for many of the South Asian Languages. Though there are several attempts have been made in developing spell checkers and correctors for Tamil, not many successes are reported; more importantly, most of them are only on papers and not available for the public to download or use. This paper outlines an approach using which a spell checker can be easily developed for Tamil like languages which share similar word-level structure. A dictionary of 4.0M Tamil words which was created from various sources is used to check the spelling. A character level bi-gram similarity matching, minimum edit distance measures and word frequencies were used to make suggestions for misspelled words. Besides, the techniques like hash keys and hash table were used to improve the processing speed of spell checking and suggestion generation. Since there were no benchmark datasets available to evaluate Tamil spell checkers, a benchmark dataset was also compiled systematically as a part of this research. DDSpell and the existing Tamil spell checkers were evaluated using the dataset and it was found that DDSpell outperformed other tools and showed the accuracy of 98.4%. DDSpell was extended to the Sinhala language to prove that it can be used to similar languages. A dictionary of 1.1M unique Sinhala words was created from various corpora and stored for the usage. The Sinhala spell checker showed an accuracy of 98% for a random dataset. The link of the DDSpell is made available via GitHub to the general public to use and test. Further, DDSpell is being extended to add more languages and features like API access. |
Databáze: | OpenAIRE |
Externí odkaz: |