Popis: |
Protein Function identification has become an important task due to a plethora of new genomes being sequenced. Recently, distributed representation [1] of words in the form of continuous vector representations has been found to be a very efficient way to represent semantic/syntactic information. In this representation, each word is embedded in an n- dimensional space with similar words having proximate vectors in the embedding space. In the popular skip-gram configuration, the current word is used by the model to predict its surrounding words. In this work we introduce reduced amino acid alphabets based, distributed representation for protein sequences. In our RA2Vec (Reduced Alphabets to Vectors) implementation we first map all Swiss-Prot sequences to hydropathy and conformational similarity based reduced form. Further, by employing skip-gram based method, reduced alphabets embedding vectors (RA2Vec) were created for each set. Embedding vectors for sequences with original ProtVec representation [2] were also created. These vectors were created for various combinations of K-grams and vector sizes. All seven combinations of the original ProtVec embedding vectors, Hydropathy based embedding vectors and Conformational Similarity based embedding vectors were then employed as input to Support Vector Machines classifiers and classification models were built. The embedding vectors were further reduced using recursive Feature Elimination (RFE) method to maximize fivefold CV accuracy. We assessed the validity and the utility of the new representations employing five different data sets. Our results with all data sets indicate, certain synergistic combinations of new representations with and without ProtVec embedding can result in significantly improved performance. |