snpQT: flexible, reproducible, and comprehensive quality control and imputation of genomic data

Autor: Christina Vasilopoulou, William Duddy, Benjamin Wingfield, Andrew P. Morris
Jazyk: angličtina
Rok vydání: 2021
Předmět:
Quality Control
Population Stratification
GWAS pipeline
Computer science
media_common.quotation_subject
Population
computer.software_genre
General Biochemistry
Genetics and Molecular Biology

03 medical and health sciences
0302 clinical medicine
Software
Genomic Variants
Humans
GWAS
Quality (business)
Quantitative Biology - Genomics
Imputation (statistics)
General Pharmacology
Toxicology and Pharmaceutics

education
QC
Imputation
030304 developmental biology
media_common
Genomics (q-bio.GN)
0303 health sciences
education.field_of_study
Genome
General Immunology and Microbiology
Software Tool Article
business.industry
Reproducibility of Results
Genomics
Articles
General Medicine
Anaconda
Pipeline (software)
Nextflow
Workflow
FOS: Biological sciences
User control
Scalability
BioContainers
Data mining
business
computer
030217 neurology & neurosurgery
SNPs
Zdroj: F1000Research
Popis: Quality control of genomic data is an essential but complicated multi-step procedure, often requiring separate installation and expert familiarity with a combination of different bioinformatics tools. Software incompatibilities, and inconsistencies across computing environments, are recurrent challenges, leading to poor reproducibility. Existing semi-automated or automated solutions lack comprehensive quality checks, flexible workflow architecture, and user control. To address these challenges, we have developed snpQT: a scalable, stand-alone software pipeline using nextflow and BioContainers, for comprehensive, reproducible and interactive quality control of human genomic data. snpQT offers some 36 discrete quality filters or correction steps in a complete standardised pipeline, producing graphical reports to demonstrate the state of data before and after each quality control procedure. This includes human genome build conversion, population stratification against data from the 1,000 Genomes Project, automated population outlier removal, and built-in imputation with its own pre- and post- quality controls. Common input formats are used, and a synthetic dataset and comprehensive online tutorial are provided for testing, educational purposes, and demonstration. The snpQT pipeline is designed to run with minimal user input and coding experience; quality control steps are implemented with numerous user-modifiable thresholds, and workflows can be flexibly combined in custom combinations. snpQT is open source and freely available at https://github.com/nebfield/snpQT. A comprehensive online tutorial and installation guide is provided through to GWAS (https://snpqt.readthedocs.io/en/latest/), introducing snpQT using a synthetic demonstration dataset and a real-world Amyotrophic Lateral Sclerosis SNP-array dataset.
Databáze: OpenAIRE