An Ensemble Method for Predicting Subnuclear Localizations from Primary Protein Structures

Autor: Vo Anh, Yu-Chu Tian, Guo-Sheng Han, Zu-Guo Yu, Anaththa P. D. Krishnajith
Rok vydání: 2013
Předmět:
Proteomics
Models
Molecular

060102 Bioinformatics
Support Vector Machine
Bioinformatics
Biochemistry
Protein sequencing
Sequence Analysis
Protein

Protein structures
Macromolecular Structure Analysis
Databases
Protein

080301 Bioinformatics Software
Physics
Multidisciplinary
Statistics
080299 Computation Theory and Mathematics not elsewhere classified
Protein Transport
Kernel (statistics)
Medicine
Sequence Analysis
Research Article
Computer Modeling
Subcellular Fractions
Protein Structure
Science
Feature vector
Feature extraction
Biophysics
Stability (learning theory)
Biostatistics
Cross-validation
Permutation
Amino Acid Sequence
Biology
Cell Nucleus
business.industry
Proteins
Computational Biology
Reproducibility of Results
Pattern recognition
Support vector machine
Subnuclear localizations
ComputingMethodologies_PATTERNRECOGNITION
ROC Curve
Computer Science
Artificial intelligence
business
Mathematics
Zdroj: PLoS ONE, Vol 8, Iss 2, p e57225 (2013)
PLoS ONE
PLOS ONE
ISSN: 1932-6203
DOI: 10.1371/journal.pone.0057225
Popis: BackgroundPredicting protein subnuclear localization is a challenging problem. Some previous works based on non-sequence information including Gene Ontology annotations and kernel fusion have respective limitations. The aim of this work is twofold: one is to propose a novel individual feature extraction method; another is to develop an ensemble method to improve prediction performance using comprehensive information represented in the form of high dimensional feature vector obtained by 11 feature extraction methods.Methodology/principal findingsA novel two-stage multiclass support vector machine is proposed to predict protein subnuclear localizations. It only considers those feature extraction methods based on amino acid classifications and physicochemical properties. In order to speed up our system, an automatic search method for the kernel parameter is used. The prediction performance of our method is evaluated on four datasets: Lei dataset, multi-localization dataset, SNL9 dataset and a new independent dataset. The overall accuracy of prediction for 6 localizations on Lei dataset is 75.2% and that for 9 localizations on SNL9 dataset is 72.1% in the leave-one-out cross validation, 71.7% for the multi-localization dataset and 69.8% for the new independent dataset, respectively. Comparisons with those existing methods show that our method performs better for both single-localization and multi-localization proteins and achieves more balanced sensitivities and specificities on large-size and small-size subcellular localizations. The overall accuracy improvements are 4.0% and 4.7% for single-localization proteins and 6.5% for multi-localization proteins. The reliability and stability of our classification model are further confirmed by permutation analysis.ConclusionsIt can be concluded that our method is effective and valuable for predicting protein subnuclear localizations. A web server has been designed to implement the proposed method. It is freely available at http://bioinformatics.awowshop.com/snlpred_page.php.
Databáze: OpenAIRE