Autor: |
Bernard Masua, Noel Masasi |
Jazyk: |
angličtina |
Rok vydání: |
2020 |
Předmět: |
|
Zdroj: |
Data in Brief, Vol 33, Iss , Pp 106517- (2020) |
Druh dokumentu: |
article |
ISSN: |
2352-3409 |
DOI: |
10.1016/j.dib.2020.106517 |
Popis: |
Natural Language Processing requires data to be pre-processed to guarantee quality models in different machine learning tasks. However, Swahili language have been disadvantaged and is classified as low resource language because of inadequate data for NLP especially basic textual datasets that are useful during pre-processing stage. In this article we develop and contribute common Swahili Stop-words, common Swahili Slangs and common Swahili Typos datasets. The main source for these datasets were short Swahili messages collected from Tanzanian platform that is used by young people to convey their opinions on things that matters to them. Therefore, we derive list of common Swahili stop-words by reviewing most frequent words that are generated with Python script from our corpus, review common slang with help of Swahili experts with their corresponding proper words, and generate common Swahili typos by analysing least frequent words generated by a Python script from corpus. The datasets were exported into files for easy access and reuse. These datasets can be reused in natural language processing as resources in pre-processing phase for Swahili textual data. |
Databáze: |
Directory of Open Access Journals |
Externí odkaz: |
|