Abugida Normalizer and Parser for Unicode texts
Autor: | Ansary, Nazmuddoha, Adib, Quazi Adibur Rahman, Reasat, Tahsin, Mehnaz, Sazia, Sushmit, Asif Shahriyar, Humayun, Ahmed Imtiaz, Rashid, Mohammad Mamun Or, Sadeque, Farig |
---|---|
Jazyk: | angličtina |
Rok vydání: | 2023 |
Předmět: | |
Popis: | This paper proposes two libraries to address common and uncommon issues with Unicode-based writing schemes for Indic languages. The first is a normalizer that corrects inconsistencies caused by the encoding scheme https://pypi.org/project/bnunicodenormalizer/ . The second is a grapheme parser for Abugida text https://pypi.org/project/indicparser/ . Both tools are more efficient and effective than previously used tools. We report 400% increase in speed and ensure significantly better performance for different language model based downstream tasks. 3 pages, 1 figure |
Databáze: | OpenAIRE |
Externí odkaz: |