Creating Specialized Corpora from Digitized Historical Newspaper Archives

Autor:	Joshua Wilson Black
Rok vydání:	2022
Předmět:	Linguistics and Language Language and Linguistics Computer Science Applications Information Systems
DOI:	10.17605/osf.io/7crgt
Popis:	The availability of large digital archives of historical newspaper content has transformed the historical sciences. However, the scale of these archives can limit the direct application of advanced text processing methods. Even if it is computationally feasible to apply sophisticated language processing to an entire digital archive, if the material of interest is a small fraction of the archive, the results are unlikely to be useful. Methods for generating smaller specialized corpora from large archives are required to solve this problem. This article presents such a method for historical newspaper archives digitized using the METS/ALTO XML standard (Veridian Software, n.d.). The method is an ‘iterative bootstrapping’ approach in which candidate corpora are evaluated using text mining techniques, items are manually labelled, and Naïve Bayes text classifiers are trained and applied in order to produce new candidate corpora. The method is illustrated by a case study that investigates philosophical content, broadly construed, in pre-1900 English-language New Zealand newspapers. Extensive code is provided in Supplementary Materials.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::76d9428c314d41fb6a6606b375557741 Zobrazit plný text záznamu