Protein classification using modifiedn-gramandskip-grammodels
Autor: | S M Ashiqul Islam, Christopher M. Kearney, Erich J. Baker, Benjamin J. Heil |
---|---|
Rok vydání: | 2017 |
Předmět: |
Process (engineering)
Computer science business.industry Structural Classification of Proteins database Machine learning computer.software_genre Cross-validation Range (mathematics) ComputingMethodologies_PATTERNRECOGNITION n-gram Artificial intelligence business computer Selection (genetic algorithm) Gram |
DOI: | 10.1101/170407 |
Popis: | MotivationClassification by supervised machine learning greatly facilitates the annotation of protein characteristics from their primary sequence. However, the feature generation step in this process requires detailed knowledge of attributes used to classify the proteins. Lack of this knowledge risks the selection of irrelevant features, resulting in a faulty model. In this study, we introduce a means of automating the work-intensive feature generation step via a Natural Language Processing (NLP)-dependent model, using a modified combination of N-Gram and Skip-Gram models (m-NGSG).ResultsA meta-comparison of cross validation accuracy with twelve training datasets from nine different published studies demonstrates a consistent increase in accuracy of m-NGSG when compared to contemporary classification and feature generation models. We expect this model to accelerate the classification of proteins from primary sequence data and increase the accessibility of protein prediction to a broader range of scientists.Availabilitym-NGSG is freely available at Bitbucket:https://bitbucket.org/smislam/mngsg/srcSupplementslink to supplementary documentsContactErich_Baker@baylor.edu |
Databáze: | OpenAIRE |
Externí odkaz: |