Adventures in Correcting XML Collation Problems with Python and XSLT

Autor: Elisa E. Beshero-Bondar
Rok vydání: 2022
Zdroj: Balisage Series on Markup Technologies.
ISSN: 1947-2609
Popis: The process of instructing a computer to compare texts, known as computer-aided collation, might resemble trying to fix a power loom when the threads it is supposed to weave together become tangled. The power of the automated weaving continues, with the threads improperly aligned and the pattern broken in a way that can make it difficult to isolate the cause of the problem. Automating a tedious process magnifies the complexity of error-correction, sometimes calling for new tooling to help us perfect the weaving or collating process. The authors are attempting to refine a collation algorithm to improve its alignment of variant passages in the Frankenstein Variorum project. We have begun with a Python script that tokenizes and normalizes the texts of the editions and delivers them to collateX for processing the collation and delivering TEI-conformant output for our project. In post-processing stages after running the collation, we apply a series of XSLT transformations to the collation output. This post-collation XSLT pipeline publishes the digital variorum edition, which prepares each output witness in TEI XML to store information about its own variance from the other editions. We have discussed that pipeline elsewhere, but our interest in this paper is in efforts to repair and correct and improve the collation process. We have applied Schematron and XSLT in post-processing to correct patterns of erroneous alignments, but eventually realized that the problems we were trying to solve required repairing the collation algorithm. We are now experimenting with revising the collation algorithm in two ways: 1) by fine-tuning the text preparation algorithms we apply in our Python file that delivers text to the collateX software, and 2) by attempting to introduce those same text preparation algorithms entirely with XSLT using the Text Alignment Network's XSLT application of tan:diff() and tan:collate(), introduced by Joel Kalvesmaki at the 2021 Balisage conference. In this paper we discuss the challenges of figuring out where and how to intervene in the collation process, and what we are learning about how far we can take XSLT and Schematron in helping to automate the preparation, collation, and correction process.
Databáze: OpenAIRE