Abstrakt: |
New technologies for seeking information are based in machine learning techniques such as statistical or deep learning approaches that require a large number of computational resources as well as the availability of huge corpora to develop the applications that, in this concrete sub-area of Artificial Intelligence, are the socalled models. Nowadays, the reusability of the developed models is approached with fine-tuning and transfer learning techniques. When the available corpus is written in a language or domain with scarce resources, the accuracy of these approaches decreases, so it is important to address the start of the task by using state-of-the-art techniques. This is the main problem tackled in the work presented here, coming from the art historians' interest in an image-based digitized collection of newspapers called Diario de Madrid (DM) from the Spanish press between 18th and 19th centuries, which is freely available at the Spanish National Library (BNE). Their focus is on information related to entities such as historical persons, locations as well as objects for sale or lost and others, to obtain geo-localization visualizations and solve some historical riddles. The first step needed technically is to obtain the transcriptions of the original digitalized newspapers from the DM (1788-1825) collection. After that, the second step is the development of a Named Entity Recognition (NER) model to label or annotate automatically the available corpus with the entities of interest for their research. For this, once the CLARA-DM corpus is created, a sub-corpus must be manually annotated for the training step in current Natural Language Processing (NLP) techniques, using human effort helped by selected computational tools. To develop the necessary annotation model (CLARA-AM), an experimentation step is carried out with state-of-the-art Deep Learning (DL) models and an already available corpus, which complements the corpus that we have developed. A main contribution of the paper is the methodology developed to tackle similar problems like that of art historians' digitized corpus: selecting specific tools when available, reusing developed DL models to carry out new experiments in an available corpus, reproducing experiments in the art historians' own corpus and applying transfer learning techniques within a domain with few resources. Four different resources developed are described: the transcribed corpus, the DL-based transcription model, the annotated corpus and the DL models developed for the annotation using a specific domain-based set of labels in a small corpus. The CLARA-TM transcription model learned for the DM is accessible from January 2023 at the READ-COOP website under the title "Spanish print XVIII-XIX - Free Public AI Model for Text Recognition with Transkribus" (https://readcoop.eu/model/spanish-print-xviii-xix/). [ABSTRACT FROM AUTHOR] |