Effective multi-dialectal arabic POS tagging
Autor: | Llu'is Màrquez, Hamdy Mubarak, Ahmed Abdelali, Mohammed Attia, Laura Kallmeyer, Mohamed Eldesouki, Younes Samih, Kareem Darwish |
---|---|
Rok vydání: | 2020 |
Předmět: |
Conditional random field
050101 languages & linguistics Linguistics and Language Sequence Artificial neural network business.industry Computer science 05 social sciences 02 engineering and technology Variety (linguistics) computer.software_genre Language and Linguistics Data set Artificial Intelligence 0202 electrical engineering electronic engineering information engineering 020201 artificial intelligence & image processing 0501 psychology and cognitive sciences Artificial intelligence Layer (object-oriented design) CRFS business computer Software Natural language processing Word (computer architecture) |
Zdroj: | Natural Language Engineering. 26:677-690 |
ISSN: | 1469-8110 1351-3249 |
Popis: | This work introduces robust multi-dialectal part of speech tagging trained on an annotated data set of Arabic tweets in four major dialect groups: Egyptian, Levantine, Gulf, and Maghrebi. We implement two different sequence tagging approaches. The first uses conditional random fields (CRFs), while the second combines word- and character-based representations in a deep neural network with stacked layers of convolutional and recurrent networks with a CRF output layer. We successfully exploit a variety of features that help generalize our models, such as Brown clusters and stem templates. Also, we develop robust joint models that tag multi-dialectal tweets and outperform uni-dialectal taggers. We achieve a combined accuracy of 92.4% across all dialects, with per dialect results ranging between 90.2% and 95.4%. We obtained the results using a train/dev/test split of 70/10/20 for a data set of 350 tweets per dialect. |
Databáze: | OpenAIRE |
Externí odkaz: |