Tucuxi-BLAST: Enabling fast and accurate record linkage of large-scale health-related administrative databases through a DNA-encoded approach.

Autor: Araujo JD; Department of Clinical and Toxicological Analyses, Universidade de São Paulo, São Paulo, SP, Brazil., Santos-E-Silva JC; Department of Clinical and Toxicological Analyses, Universidade de São Paulo, São Paulo, SP, Brazil., Costa-Martins AG; Department of Clinical and Toxicological Analyses, Universidade de São Paulo, São Paulo, SP, Brazil.; Scientific Platform Pasteur USP, São Paulo, SP, Brazil., Sampaio V; Fundação de Medicina Tropical Dr. Heitor Vieira Dourado, Manaus, Brazil.; Instituto Todos pela Saúde, São Paulo, SP, Brazil., de Castro DB; Fundação de Vigilância em Saúde do Amazonas, Manaus, Brazil., de Souza RF; Departamento de Microbiologia, Universidade de São Paulo, São Paulo, Brazil., Giddaluru J; Department of Clinical and Toxicological Analyses, Universidade de São Paulo, São Paulo, SP, Brazil., Ramos PIP; Oswaldo Cruz Foundation, Salvador, Brazil., Pita R; Oswaldo Cruz Foundation, Salvador, Brazil., Barreto ML; Oswaldo Cruz Foundation, Salvador, Brazil., Barral-Netto M; Oswaldo Cruz Foundation, Salvador, Brazil., Nakaya HI; Department of Clinical and Toxicological Analyses, Universidade de São Paulo, São Paulo, SP, Brazil.; Scientific Platform Pasteur USP, São Paulo, SP, Brazil.; Instituto Todos pela Saúde, São Paulo, SP, Brazil.; Hospital Israelita Albert Einstein, São Paulo, SP, Brazil.
Jazyk: angličtina
Zdroj: PeerJ [PeerJ] 2022 Jul 11; Vol. 10, pp. e13507. Date of Electronic Publication: 2022 Jul 11 (Print Publication: 2022).
DOI: 10.7717/peerj.13507
Abstrakt: Background: Public health research frequently requires the integration of information from different data sources. However, errors in the records and the high computational costs involved make linking large administrative databases using record linkage (RL) methodologies a major challenge.
Methods: We present Tucuxi-BLAST, a versatile tool for probabilistic RL that utilizes a DNA-encoded approach to encrypt, analyze and link massive administrative databases. Tucuxi-BLAST encodes the identification records into DNA. BLASTn algorithm is then used to align the sequences between databases. We tested and benchmarked on a simulated database containing records for 300 million individuals and also on four large administrative databases containing real data on Brazilian patients.
Results: Our method was able to overcome misspellings and typographical errors in administrative databases. In processing the RL of the largest simulated dataset (200k records), the state-of-the-art method took 5 days and 7 h to perform the RL, while Tucuxi-BLAST only took 23 h. When compared with five existing RL tools applied to a gold-standard dataset from real health-related databases, Tucuxi-BLAST had the highest accuracy and speed. By repurposing genomic tools, Tucuxi-BLAST can improve data-driven medical research and provide a fast and accurate way to link individual information across several administrative databases.
Competing Interests: Helder I. Nakaya and Robson Souza are Academic Editors for PeerJ.
(© 2022 Araujo et al.)
Databáze: MEDLINE