Identifying overrepresented concepts in gene lists from literature: a statistical approach based on Poisson mixture model

Autor:	ChengXiang Zhai, Xin He, Brant W. Chee, Xu Ling, Bruce R. Schatz, Moushumi Sen Sarma
Rok vydání:	2010
Předmět:	Computer science Computational biology lcsh:Computer applications to medicine. Medical informatics Poisson distribution computer.software_genre Biochemistry 03 medical and health sciences symbols.namesake Annotation 0302 clinical medicine Structural Biology Research article Controlled vocabulary lcsh:QH301-705.5 Molecular Biology Gene 030304 developmental biology 0303 health sciences Models Statistical Gene Expression Profiling Applied Mathematics Computational Biology Mixture model Expression (mathematics) Computer Science Applications Gene expression profiling ComputingMethodologies_PATTERNRECOGNITION lcsh:Biology (General) Genes symbols lcsh:R858-859.7 Data mining DNA microarray computer 030217 neurology & neurosurgery
Zdroj:	BMC Bioinformatics BMC Bioinformatics, Vol 11, Iss 1, p 272 (2010)
ISSN:	1471-2105
Popis:	Background Large-scale genomic studies often identify large gene lists, for example, the genes sharing the same expression patterns. The interpretation of these gene lists is generally achieved by extracting concepts overrepresented in the gene lists. This analysis often depends on manual annotation of genes based on controlled vocabularies, in particular, Gene Ontology (GO). However, the annotation of genes is a labor-intensive process; and the vocabularies are generally incomplete, leaving some important biological domains inadequately covered. Results We propose a statistical method that uses the primary literature, i.e. free-text, as the source to perform overrepresentation analysis. The method is based on a statistical framework of mixture model and addresses the methodological flaws in several existing programs. We implemented this method within a literature mining system, BeeSpace, taking advantage of its analysis environment and added features that facilitate the interactive analysis of gene sets. Through experimentation with several datasets, we showed that our program can effectively summarize the important conceptual themes of large gene sets, even when traditional GO-based analysis does not yield informative results. Conclusions We conclude that the current work will provide biologists with a tool that effectively complements the existing ones for overrepresentation analysis from genomic experiments. Our program, Genelist Analyzer, is freely available at: http://workerbee.igb.uiuc.edu:8080/BeeSpace/Search.jsp
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::9c8168abd8c33d9c2cd713e9d7a4d7a2 https://doi.org/10.1186/1471-2105-11-272 Zobrazit plný text záznamu Full text from SpringerLink