From Digitization and Images to Text and Content: Transkribus as a Case Study.

Autor: Prebor, Gila
Předmět:
Zdroj: Manuscript Studies: A Journal of the Schoenberg Institute for Manuscript Studies; Spring2024, Vol. 9 Issue 1, p72-89, 18p
Abstrakt: Over the last decades, libraries and archives have been increasingly investing in the digitization of their collections, including manuscripts, rare books, newspapers, archival material, and more. Many of these resources are freely accessible. However, the material accessible consists only of the metadata of the resources along with their images. The textual content of the resulting digital images is not yet visible and those seeking to find the content of the resources must study and transcribe individual passages.This article has demonstrated the immense potential of technological tools in the transcription of Hebrew manuscripts. Through our analysis, we have shown that handwriting recognition models trained with Transkribus can generate usable results when applied to Hebrew Sephardic semi-cursive manuscripts from the 15th century. This marks a significant advancement in the field, as it allows for a more efficient and cost-effective approach to transcription.Our findings highlight that even with a relatively small investment, remarkable results can be achieved. The recommended amount of ground truth data for training a Transkribus model is set at approximately 15,000 transcribed words or 75 pages to recognize text written by a single hand. Adhering to the principles of machine learning, the submission of a larger volume of ground truth data enhances the accuracy of the transcription results. However, our trials have shown that even with a smaller amount of data, it is still possible to attain good outcomes. This is a promising prospect, as it facilitates the mass digitization of previously unpublished manuscripts, opening up vast opportunities for future research endeavors. [ABSTRACT FROM AUTHOR]
Databáze: Complementary Index