Abugida Normalizer and Parser for Unicode texts

Autor: Ansary, Nazmuddoha, Adib, Quazi Adibur Rahman, Reasat, Tahsin, Mehnaz, Sazia, Sushmit, Asif Shahriyar, Humayun, Ahmed Imtiaz, Rashid, Mohammad Mamun Or, Sadeque, Farig
Jazyk: angličtina
Rok vydání: 2023
Předmět:
Popis: This paper proposes two libraries to address common and uncommon issues with Unicode-based writing schemes for Indic languages. The first is a normalizer that corrects inconsistencies caused by the encoding scheme https://pypi.org/project/bnunicodenormalizer/ . The second is a grapheme parser for Abugida text https://pypi.org/project/indicparser/ . Both tools are more efficient and effective than previously used tools. We report 400% increase in speed and ensure significantly better performance for different language model based downstream tasks.
3 pages, 1 figure
Databáze: OpenAIRE