Evaluating a variety of text-mined features for automatic protein function prediction with GOstruct.

Autor: Funk CS; Computational Bioscience Program, University of Colorado School of Medicine, Aurora, 80045 CO USA., Kahanda I; Department of Computer Science, Colorado State University, Fort Collins, 80523 CO USA., Ben-Hur A; Department of Computer Science, Colorado State University, Fort Collins, 80523 CO USA., Verspoor KM; Department of Computing and Information Systems, University of Melbourne, Parkville, 3010 Victoria Australia ; Health and Biomedical Informatics Centre, University of Melbourne, Parkville, 3010 Victoria Australia.
Jazyk: angličtina
Zdroj: Journal of biomedical semantics [J Biomed Semantics] 2015 Mar 18; Vol. 6, pp. 9. Date of Electronic Publication: 2015 Mar 18 (Print Publication: 2015).
DOI: 10.1186/s13326-015-0006-4
Abstrakt: Most computational methods that predict protein function do not take advantage of the large amount of information contained in the biomedical literature. In this work we evaluate both ontology term co-mention and bag-of-words features mined from the biomedical literature and analyze their impact in the context of a structured output support vector machine model, GOstruct. We find that even simple literature based features are useful for predicting human protein function (F-max: Molecular Function =0.408, Biological Process =0.461, Cellular Component =0.608). One advantage of using literature features is their ability to offer easy verification of automated predictions. We find through manual inspection of misclassifications that some false positive predictions could be biologically valid predictions based upon support extracted from the literature. Additionally, we present a "medium-throughput" pipeline that was used to annotate a large subset of co-mentions; we suggest that this strategy could help to speed up the rate at which proteins are curated.
Databáze: MEDLINE