Protocol: Investigating DOIs classes of errors v5

Autor: Arcangelo Massari, Deniz Tural, Ricarda Boente, Cristian Santini
Rok vydání: 2021
DOI: 10.17504/protocols.io.buuknwuw
Popis: The purpose of this protocol is to provide an automated process to repair invalid DOIs that have been collected by the OpenCitations Index Of Crossref Open DOI-To-DOI References (COCI) while processing data provided by Crossref. The data needed for this work is provided by Silvio Peroni as a CSV containing pairs of valid citing DOIs and invalid cited DOIs. With the goal to determine an automated process, we first classified the errors that characterize the wrong DOIs in the list. The starting hypothesis is that there are two main classes of errors: factual errors, such as wrong characters, and DOIs that are not yet valid at the time of processing. The first class can be furtherly divided into three classes: errors due to irrelevant strings added to the beginning (prefix-type errors) or at the end (suffix-type errors) of the correct DOI, and errors due to unwanted characters in the middle (other-type errors). Once the classes of errors are addressed, we propose automatic processes to obtain correct DOIs from wrong ones. These processes involve the use of the information returned from DOI API, the January 2021 Public Data File from Crossref, as well as rule-based methods, including regular expressions to correct invalid DOIs. The application of this methodology produced a CSV dataset containing all the pairs of citing and cited DOIs in the original dataset, each one enriched by 5 fields: "Already_Valid", which tells if the cited DOI was already valid before cleaning, "New_DOI", which contain a clean, valid DOI (if our procedure was able to produce one), and "prefix_error", "suffix_error" and "other-type_error" fields, which contain, for each cleaned DOI the number of errors that were cleaned.
Databáze: OpenAIRE