TSLRF: Two-Stage Algorithm Based on Least Angle Regression and Random Forest in genome-wide association studies
Autor: | Dafeng Shen, Yang-Jun Wen, Qingtai Wu, Jie Ding, Fengrong Liu, Yu Gao, Jiali Sun, Jin Zhang |
---|---|
Jazyk: | angličtina |
Rok vydání: | 2019 |
Předmět: |
Agricultural genetics
0301 basic medicine Quantitative Trait Loci 0206 medical engineering Arabidopsis lcsh:Medicine Genome-wide association study Single-nucleotide polymorphism 02 engineering and technology Quantitative trait locus Polymorphism Single Nucleotide Genome-wide association studies Article Machine Learning 03 medical and health sciences lcsh:Science Mathematics Genetic association Multidisciplinary Models Genetic business.industry Least-angle regression lcsh:R Linear model Pattern recognition Genomics Random forest 030104 developmental biology Multigene Family Linear Models Trait lcsh:Q Artificial intelligence business Genome Plant 020602 bioinformatics Genome-Wide Association Study |
Zdroj: | Scientific Reports, Vol 9, Iss 1, Pp 1-10 (2019) Scientific Reports |
ISSN: | 2045-2322 |
DOI: | 10.1038/s41598-019-54519-x |
Popis: | One of the most important tasks in genome-wide association analysis (GWAS) is the detection of single-nucleotide polymorphisms (SNPs) which are related to target traits. With the development of sequencing technology, traditional statistical methods are difficult to analyze the corresponding high-dimensional massive data or SNPs. Recently, machine learning methods have become more popular in high-dimensional genetic data analysis for their fast computation speed. However, most of machine learning methods have several drawbacks, such as poor generalization ability, over-fitting, unsatisfactory classification and low detection accuracy. This study proposed a two-stage algorithm based on least angle regression and random forest (TSLRF), which firstly considered the control of population structure and polygenic effects, then selected the SNPs that were potentially related to target traits by using least angle regression (LARS), furtherly analyzed this variable subset using random forest (RF) to detect quantitative trait nucleotides (QTNs) associated with target traits. The new method has more powerful detection in simulation experiments and real data analyses. The results of simulation experiments showed that, compared with the existing approaches, the new method effectively improved the detection ability of QTNs and model fitting degree, and required less calculation time. In addition, the new method significantly distinguished QTNs and other SNPs. Subsequently, the new method was applied to analyze five flowering-related traits in Arabidopsis. The results showed that, the distinction between QTNs and unrelated SNPs was more significant than the other methods. The new method detected 60 genes confirmed to be related to the target trait, which was significantly higher than the other methods, and simultaneously detected multiple gene clusters associated with the target trait. |
Databáze: | OpenAIRE |
Externí odkaz: | |
Nepřihlášeným uživatelům se plný text nezobrazuje | K zobrazení výsledku je třeba se přihlásit. |