ArzEn-MultiGenre: An aligned parallel dataset of Egyptian Arabic song lyrics, novels, and subtitles, with English translations

Autor:	Rania Al-Sabbagh
Jazyk:	angličtina
Rok vydání:	2024
Předmět:	Parallel datasets Arabic dialects Benchmarking datasets Finetuning large-language models Machine translation Translation studies Computer applications to medicine. Medical informatics R858-859.7 Science (General) Q1-390
Zdroj:	Data in Brief, Vol 54, Iss , Pp 110271- (2024)
Druh dokumentu:	article
ISSN:	2352-3409
DOI:	10.1016/j.dib.2024.110271
Popis:	ArzEn-MultiGenre is a parallel dataset of Egyptian Arabic song lyrics, novels, and TV show subtitles that are manually translated and aligned with their English counterparts. The dataset contains 25,557 segment pairs that can be used to benchmark new machine translation models, fine-tune large language models in few-shot settings, and adapt commercial machine translation applications such as Google Translate. Additionally, the dataset is a valuable resource for research in various disciplines, including translation studies, cross-linguistic analysis, and lexical semantics. The dataset can also serve pedagogical purposes by training translation students and aid professional translators as a translation memory. The contributions are twofold: first, the dataset features textual genres not found in existing parallel Egyptian Arabic and English datasets, and second, it is a gold-standard dataset that has been translated and aligned by human experts.
Databáze:	Directory of Open Access Journals
Externí odkaz:	https://doaj.org/article/de54fb20a2e44c0fb11f1b5e8aba7146 Zobrazit plný text záznamu View record in DOAJ