Arabic punctuation dataset.

Autor: Yagi S; Department of Foreign Languages, University of Sharjah, the United Arab Emirates., Elnagar A; Department of Computer Science, University of Sharjah, the United Arab Emirates., Yaghi E; Department of Linguistics, University of Waikato, Hamilton, New Zealand.
Jazyk: angličtina
Zdroj: Data in brief [Data Brief] 2024 Feb 01; Vol. 53, pp. 110118. Date of Electronic Publication: 2024 Feb 01 (Print Publication: 2024).
DOI: 10.1016/j.dib.2024.110118
Abstrakt: Arabic, unlike many languages, suffers from punctuation inconsistency, posing a significant obstacle for Natural Language Processing (NLP). To address this, we present the Arabic Punctuation Dataset (APD), a large collection of annotated Modern Standard Arabic texts designed to train machine learning models in sentence boundary identification and punctuation prediction. APD leverages the "theme-rheme completion" principle, a grammatical feature closely linked to consistent punctuation placement. It consists of an annotated collection of Modern Standard Arabic (MSA) texts that encompass 312 million words in approximately 12 million sentences. It comprises three diverse components: Arabic Book Chapters (ABC): Manually annotated, non-fiction, book excerpts, constituting a gold-standard reference. Complete Book Translations (CBT): Parallel English-Arabic book translations with aligned sentence endings, ideal for machine translation training. Scrambled Sentences from the Arabic Component of the United Nations Parallel Corpus (SSAC-UNPC): Jumbled sentences for model training in automatic punctuation restoration. Beyond NLP, APD serves as a valuable resource for linguistics research, language learning, and real-time subtitling. Its authentic, grammar-based approach can enhance the readability and clarity of machine-generated text, opening doors for various applications such as automatic speech recognition, text summarization, and machine translation.
(© 2024 The Author(s).)
Databáze: MEDLINE