Identification of sample annotation errors in gene expression datasets

Autor:	Karolina Edlund, Jan G. Hengstler, Marcus Schmidt, Miriam Lohr, Birte Hellwig, Jörg Rahnenführer, Patrick Micke, Johan Botling, Johanna Sofia Margareta Mattsson
Rok vydání:	2015
Předmět:	Quality Control Health Toxicology and Mutagenesis Pharmacology and Toxicology Computational biology Microarray Biology Toxicology Bioinformatics Polymorphism Single Nucleotide Statistical power Male–female classifier Gene expression Misannotation Quality control Annotation Female patient Humans Statistical analysis Male-female classifier General Medicine Toxicogenomics Farmakologi och toxikologi Gene expression profiling Correlation analysis Classifier (UML) Genome-Wide Association Study
Zdroj:	Archives of toxicology, 89(12): 2265-2272 Archives of Toxicology
ISSN:	1432-0738 0340-5761
DOI:	10.1007/s00204-015-1632-4
Popis:	The comprehensive transcriptomic analysis of clinically annotated human tissue has found widespread use in oncology, cell biology, immunology, and toxicology. In cancer research, microarray-based gene expression profiling has successfully been applied to subclassify disease entities, predict therapy response, and identify cellular mechanisms. Public accessibility of raw data, together with corresponding information on clinicopathological parameters, offers the opportunity to reuse previously analyzed data and to gain statistical power by combining multiple datasets. However, results and conclusions obviously depend on the reliability of the available information. Here, we propose gene expression-based methods for identifying sample misannotations in public transcriptomic datasets. Sample mix-up can be detected by a classifier that differentiates between samples from male and female patients. Correlation analysis identifies multiple measurements of material from the same sample. The analysis of 45 datasets (including 4913 patients) revealed that erroneous sample annotation, affecting 40 % of the analyzed datasets, may be a more widespread phenomenon than previously thought. Removal of erroneously labelled samples may influence the results of the statistical evaluation in some datasets. Our methods may help to identify individual datasets that contain numerous discrepancies and could be routinely included into the statistical analysis of clinical gene expression data. Electronic supplementary material The online version of this article (doi:10.1007/s00204-015-1632-4) contains supplementary material, which is available to authorized users.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::5462dce3a3dd2b39f4738236eef24d3f https://doi.org/10.1007/s00204-015-1632-4 Zobrazit plný text záznamu Full text from SpringerLink