Understanding the influence of individual variables contributing to multivariate outliers in assessments of data quality
Autor: | Jianfeng Ding, Laura Castro-Schilo, Richard C. Zink |
---|---|
Rok vydání: | 2018 |
Předmět: |
Statistics and Probability
030213 general clinical medicine Multivariate statistics Computer science Machine learning computer.software_genre 01 natural sciences 010104 statistics & probability 03 medical and health sciences 0302 clinical medicine Data visualization Covariate Humans Pharmacology (medical) 0101 mathematics Pharmacology Clinical Trials as Topic Mahalanobis distance business.industry Data Accuracy Identification (information) Data quality Principal component analysis Outlier Artificial intelligence business computer |
Zdroj: | Pharmaceutical Statistics. 17:846-853 |
ISSN: | 1539-1604 |
DOI: | 10.1002/pst.1903 |
Popis: | Mahalanobis distance is often recommended to identify patients or clinical sites that are considered unusual in clinical trials. Patients extreme in one or more covariates may be considered outliers in that they reside some distance from the multivariate mean, which can be thought of as the center of the data cloud. Less often discussed, patients whose data are believed to be "too good to be true" are located near the centroid as inliers. In order to efficiently investigate these anomalies for potential lapses in data quality, it is important to understand how the individual variables contribute to each multivariate outlier. There is a lack of literature describing a reasonable workflow for identification of outliers and their subsequent investigation to understand how each variable contributes to an observation being considered extreme. We describe how to identify multivariate inliers and outliers, classify outliers according to varying levels of severity, and summarize the contributions of variables using principal components in a manner that is accessible to a wide audience with straightforward interpretation. We illustrate how numerous data visualizations, including Pareto plots, can facilitate further review even in studies containing numerous observations and variables. We illustrate these methodologies using data from a multicenter clinical trial. |
Databáze: | OpenAIRE |
Externí odkaz: |