Enhancing text pre-processing for Swahili language: Datasets for common Swahili stop-words, slangs and typos with equivalent proper words

Autor:	Bernard Masua, Noel Masasi
Jazyk:	angličtina
Rok vydání:	2020
Předmět:	Natural language processing Text pre-processing Swahili language Stop-words Slangs Typos Computer applications to medicine. Medical informatics R858-859.7 Science (General) Q1-390
Zdroj:	Data in Brief, Vol 33, Iss , Pp 106517- (2020)
Druh dokumentu:	article
ISSN:	2352-3409
DOI:	10.1016/j.dib.2020.106517
Popis:	Natural Language Processing requires data to be pre-processed to guarantee quality models in different machine learning tasks. However, Swahili language have been disadvantaged and is classified as low resource language because of inadequate data for NLP especially basic textual datasets that are useful during pre-processing stage. In this article we develop and contribute common Swahili Stop-words, common Swahili Slangs and common Swahili Typos datasets. The main source for these datasets were short Swahili messages collected from Tanzanian platform that is used by young people to convey their opinions on things that matters to them. Therefore, we derive list of common Swahili stop-words by reviewing most frequent words that are generated with Python script from our corpus, review common slang with help of Swahili experts with their corresponding proper words, and generate common Swahili typos by analysing least frequent words generated by a Python script from corpus. The datasets were exported into files for easy access and reuse. These datasets can be reused in natural language processing as resources in pre-processing phase for Swahili textual data.
Databáze:	Directory of Open Access Journals
Externí odkaz:	https://doaj.org/article/1d62aa94fd8c4f49a5b77b1a5c40f9d0 Zobrazit plný text záznamu View record in DOAJ