Popis: |
Pre-paper. For the published paper see : Chahan Vidal-Gorène et. al., « RASAM – A Dataset for the Recognition and Analysis of Scripts in Arabic Maghrebi » dans : Elise H. Barney-Smith, Umapada Pal (eds.), Documents Analysis and Recognition – ICDAR 2021 Workshops, Lecture Notes in Computer Science 12916, Springer, 2021, p. 265-281; The Arabic scripts raise numerous issues in text recognition and layout analysis. To overcome these, several datasets and methods have been proposed in recent years. Although the latter are focused on common scripts and layout, many Arabic writings and written traditions remain under-resourced. We therefore propose a new dataset comprising 300 images representative of the handwritten production of the Arabic Maghrebi scripts. This dataset is the achievement of a collaborative work undertaken in the first quarter of 2021, and it offers several levels of annotation and transcription. The article intends to shed light on the specificities of these writing and manuscripts, as well as highlight the challenges of the recognition. The collaborative tools used for the creation of the dataset are assessed and the dataset itself is evaluated with state of the art methods in layout analysis. The word-based text recognition method used and experimented on for these writings achieves CER of 4.8% on average. The pipeline described constitutes an experience feedback for the quick creation of data and the training of effective HTR systems for Arabic scripts and non-Latin scripts in general. |