Ukrainian Texts Classification: Exploration of Cross-lingual Knowledge Transfer Approaches

Autor: Dementieva, Daryna, Khylenko, Valeriia, Groh, Georg
Rok vydání: 2024
Předmět:
Druh dokumentu: Working Paper
Popis: Despite the extensive amount of labeled datasets in the NLP text classification field, the persistent imbalance in data availability across various languages remains evident. Ukrainian, in particular, stands as a language that still can benefit from the continued refinement of cross-lingual methodologies. Due to our knowledge, there is a tremendous lack of Ukrainian corpora for typical text classification tasks. In this work, we leverage the state-of-the-art advances in NLP, exploring cross-lingual knowledge transfer methods avoiding manual data curation: large multilingual encoders and translation systems, LLMs, and language adapters. We test the approaches on three text classification tasks -- toxicity classification, formality classification, and natural language inference -- providing the "recipe" for the optimal setups.
Databáze: arXiv