Prediction of interactiveness of proteins and nucleic acids based on feature selections

Autor:	Liang Liu, Youlang Yuan, Meng Xing, Yu-Dong Cai, Lei Gu, Xinlei Li, Minjie Li, Wencong Lu, Xiao-He Shi, Xiangyin Kong
Rok vydání:	2009
Předmět:	Molecular Sequence Data Feature selection Biology Catalysis Inorganic Chemistry chemistry.chemical_compound Protein Annotation Transcription (biology) Nucleic Acids Drug Discovery Protein Interaction Mapping Protein Interaction Domains and Motifs Amino Acid Sequence Physical and Theoretical Chemistry Molecular Biology Biological data business.industry Organic Chemistry RNA Computational Biology Proteins Pattern recognition Molecular Sequence Annotation General Medicine DNA Models Theoretical chemistry RNA splicing Nucleic acid Artificial intelligence business Algorithms Information Systems Forecasting Protein Binding
Zdroj:	Molecular diversity. 14(4)
ISSN:	1573-501X
Popis:	It is important to identify which proteins can interact with nucleic acids for the purpose of protein annotation, since interactions between nucleic acids and proteins involve in numerous cellular processes such as replication, transcription, splicing, and DNA repair. This research tries to identify proteins that can interact with DNA, RNA, and rRNA, respectively. mRMR (Minimum redundancy and maximum relevance), with its elegant mathematical formulation, has been applied widely in processing biological data and feature analysis since its introduction in 2005. mRMR plus incremental feature selection (IFS) is known to be very efficient in feature selection and analysis, and able to improve both effectiveness and efficiency of a prediction model. IFS is applied to decide how many features should be selected from feature list provided by mRMR. In the end, the selected features of mRMR and IFS are further refined by a conventional feature selection method--forward feature wrapper (FFW), by reordering the features. Each protein is coded by 132 features including amino acid compositions and physicochemical properties. After the feature selection, k-Nearest Neighbor algorithm, the adopted prediction model, is trained and tested. As a result, the optimized prediction accuracies for the DNA, RNA, and rRNA are 82.0, 83.4, and 92.3%, respectively. Furthermore, the most important features that contribute to the prediction are identified and analyzed biologically. The predictor, developed for this research, is available for public access at http://chemdata.shu.edu.cn/protein_na_mrmr/.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::1cc1467f35cb042cadcae6d4c44dfb96 https://pubmed.ncbi.nlm.nih.gov/19816781 Zobrazit plný text záznamu Full text from SpringerLink