Fast and Scalable Feature Selection for Gene Expression Data Using Hilbert-Schmidt Independence Criterion
Autor: | Hadi Zarkoob, Ali Ghodsi, Mehrdad J. Gangeh |
---|---|
Rok vydání: | 2017 |
Předmět: |
0301 basic medicine
Proteome Computer science Big data Stability (learning theory) Feature selection 02 engineering and technology computer.software_genre 03 medical and health sciences Singular value decomposition 0202 electrical engineering electronic engineering information engineering Genetics Computer Simulation Independence (probability theory) Oligonucleotide Array Sequence Analysis Models Statistical business.industry Microarray analysis techniques Applied Mathematics Gene Expression Profiling ComputingMethodologies_PATTERNRECOGNITION 030104 developmental biology Kernel method Gene Expression Regulation Data Interpretation Statistical Scalability 020201 artificial intelligence & image processing Data mining business computer Algorithms Biotechnology Signal Transduction |
Zdroj: | IEEE/ACM transactions on computational biology and bioinformatics. 14(1) |
ISSN: | 1557-9964 |
Popis: | Goal: In computational biology, selecting a small subset of informative genes from microarray data continues to be a challenge due to the presence of thousands of genes. This paper aims at quantifying the dependence between gene expression data and the response variables and to identifying a subset of the most informative genes using a fast and scalable multivariate algorithm. Methods: A novel algorithm for feature selection from gene expression data was developed. The algorithm was based on the Hilbert-Schmidt independence criterion HSIC, and was partly motivated by singular value decomposition SVD. Results: The algorithm is computationally fast and scalable to large datasets. Moreover, it can be applied to problems with any type of response variables including, biclass, multiclass, and continuous response variables. The performance of the proposed algorithm in terms of accuracy, stability of the selected genes, speed, and scalability was evaluated using both synthetic and real-world datasets. The simulation results demonstrated that the proposed algorithm effectively and efficiently extracted stable genes with high predictive capability, in particular for datasets with multiclass response variables. Conclusion/Significance: The proposed method does not require the whole microarray dataset to be stored in memory, and thus can easily be scaled to large datasets. This capability is an important attribute in big data analytics, where data can be large and massively distributed. |
Databáze: | OpenAIRE |
Externí odkaz: |