Popis: |
This thesis presents a methodology of data preparation with probabilistic record linkage and information fusion for improving and enriching information visualizations of biomedical citation data. The problem of record linkage of citation databases where only non-unique identifiers such as author names and document titles are available as common identifiers to be linked was investigated. This problem in citation data parallels problems in clinical data and Knowledge Discovery in Databases (KDD) methods from clinical data mining are evaluated. Probabilistic and deterministic (exact-match) record linkage models were developed and compared through the use of a gold standard or truth dataset. Empirical comparison with ROC analysis of record linkage models showed a significant difference (p=.000) in performance of a probabilistic model over deterministic models. The methodology was evaluated with probabilistic linkage of records from the Web of Science, Medline, and CINAHL citation databases in the knowledge domains of medical informatics, HIV/AIDS, and nursing informatics. Data quality metrics for datasets prepared with probabilistic record linkage and information fusion showed improvement in completeness of key variables and reduction in sample bias. The resulting visualizations offered a richer information space for users through an increase in terms entering the visualization. The significant contributions of this work include the development of a novel model of probabilistic record linkage for biomedical citation databases which improves upon existing deterministic models. In addition a methodology for improving and enriching knowledge domain visualizations though a data preparation approach has been validated with analyses of multiple citation databases and knowledge domains. The data preparation methodology of probabilistic record linkage with information fusion offers a remedy for data quality problems, and the opportunity to enrich visualizations with added content for user exploration, which in turn improves the utility of knowledge domain visualizations as a medium for assessing available evidence and forming hypotheses. |