Books of Hours: the First Liturgical Corpus for Text Segmentation

Autor: Amir Hazem, Béatrice Daille, Marie-Laurence Bonhomme, Martin Maarand, Mélodie Boillet, Christopher Kermorvant, Dominique Stutzmann
Přispěvatelé: Traitement Automatique du Langage Naturel (TALN ), Laboratoire des Sciences du Numérique de Nantes (LS2N), Université de Nantes - UFR des Sciences et des Techniques (UN UFR ST), Université de Nantes (UN)-Université de Nantes (UN)-École Centrale de Nantes (ECN)-Centre National de la Recherche Scientifique (CNRS)-IMT Atlantique Bretagne-Pays de la Loire (IMT Atlantique), Institut Mines-Télécom [Paris] (IMT)-Institut Mines-Télécom [Paris] (IMT)-Université de Nantes - UFR des Sciences et des Techniques (UN UFR ST), Institut Mines-Télécom [Paris] (IMT)-Institut Mines-Télécom [Paris] (IMT), Teklia, Teklia (Teklia), A2iA (A2iA), A2iA, Institut de recherche et d'histoire des textes (IRHT), Centre National de la Recherche Scientifique (CNRS), ANR-17-CE38-0008,HORAE,Heures : Reconnaissance de l'écriture manuscrite, catégorisation automatique, éditions(2017), Centre National de la Recherche Scientifique (CNRS)-École Centrale de Nantes (ECN)-Université de Nantes - UFR des Sciences et des Techniques (UN UFR ST), Université de Nantes (UN)-Université de Nantes (UN)-IMT Atlantique Bretagne-Pays de la Loire (IMT Atlantique), Institut Mines-Télécom [Paris] (IMT)-Institut Mines-Télécom [Paris] (IMT)-Centre National de la Recherche Scientifique (CNRS)-École Centrale de Nantes (ECN)-Université de Nantes - UFR des Sciences et des Techniques (UN UFR ST)
Jazyk: angličtina
Rok vydání: 2020
Předmět:
Zdroj: Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020)
12th Language Resources and Evaluation Conference
12th Language Resources and Evaluation Conference, May 2020, Marseille (Virtual), France. pp.776-784
HAL
Popis: International audience; The Book of Hours was the bestseller of the late Middle Ages and Renaissance. It is a historical invaluable treasure, documentingthe devotional practices of Christians in the late Middle Ages. Up to now, its textual content has been scarcely studied because of itsmanuscript nature, its length and its complex content. At first glance, it looks too standardized. However, the study of book of hoursraises important challenges: (i) in image analysis, its often lavish ornamentation (illegible painted initials, line-fillers, etc.), abbreviatedwords, multilingualism are difficult to address in Handwritten Text Recognition (HTR); (ii) its hierarchical entangled structure offers anew field of investigation for text segmentation; (iii) in digital humanities, its textual content gives opportunities for historical analysis.In this paper, we provide the first corpus of books of hours, which consists of Latin transcriptions of 300 books of hours generated byHandwritten Text Recognition (HTR) - that is like Optical Character Recognition (OCR) but for handwritten and not printed texts. Wedesigned a structural scheme of the book of hours and annotated manually two books of hours according to this scheme. Lastly, weperformed a systematic evaluation of the main state of the art text segmentation approache
Databáze: OpenAIRE