Protein classification using modified n-grams and skip-grams
Autor: | Benjamin J. Heil, S M Ashiqul Islam, Erich J. Baker, Christopher M. Kearney |
---|---|
Rok vydání: | 2017 |
Předmět: |
0301 basic medicine
Statistics and Probability Models Molecular Web server Process (engineering) Computer science Protein Conformation computer.software_genre Biochemistry 03 medical and health sciences 0302 clinical medicine Sequence Analysis Protein Selection (linguistics) Molecular Biology Natural Language Processing business.industry Proteins Molecular Sequence Annotation Structural Classification of Proteins database Computer Science Applications Computational Mathematics Range (mathematics) ComputingMethodologies_PATTERNRECOGNITION 030104 developmental biology Computational Theory and Mathematics 030220 oncology & carcinogenesis Artificial intelligence Supervised Machine Learning business computer Natural language processing |
Zdroj: | Bioinformatics (Oxford, England). 34(9) |
ISSN: | 1367-4811 |
Popis: | Motivation Classification by supervised machine learning greatly facilitates the annotation of protein characteristics from their primary sequence. However, the feature generation step in this process requires detailed knowledge of attributes used to classify the proteins. Lack of this knowledge risks the selection of irrelevant features, resulting in a faulty model. In this study, we introduce a supervised protein classification method with a novel means of automating the work-intensive feature generation step via a Natural Language Processing (NLP)-dependent model, using a modified combination of n-grams and skip-grams (m-NGSG). Results A meta-comparison of cross-validation accuracy with twelve training datasets from nine different published studies demonstrates a consistent increase in accuracy of m-NGSG when compared to contemporary classification and feature generation models. We expect this model to accelerate the classification of proteins from primary sequence data and increase the accessibility of protein characteristic prediction to a broader range of scientists. Availability and implementation m-NGSG is freely available at Bitbucket: https://bitbucket.org/sm_islam/mngsg/src. A web server is available at watson.ecs.baylor.edu/ngsg. Supplementary information Supplementary data are available at Bioinformatics online. |
Databáze: | OpenAIRE |
Externí odkaz: |