Bayesian clustering of DNA sequences using Markov chains and a stochastic partition model.

Autor: J„„skinen, V„in”, Parkkinen, Ville, Lu Cheng, Corander, Jukka
Předmět:
Zdroj: Statistical Applications in Genetics & Molecular Biology; Jan2014, Vol. 13 Issue 1, p105-121, 17p, 6 Charts
Abstrakt: In many biological applications it is necessary to cluster DNA sequences into groups that represent underlying organismal units, such as named species or genera. In metagenomics this grouping needs typically to be achieved on the basis of relatively short sequences which contain different types of errors, making the use of a statistical modeling approach desirable. Here we introduce a novel method for this purpose by developing a stochastic partition model that clusters Markov chains of a given order. The model is based on a Dirichlet process prior and we use conjugate priors for the Markov chain parameters which enables an analytical expression for comparing the marginal likelihoods of any two partitions. To find a good candidate for the posterior mode in the partition space, we use a hybrid computational approach which combines the EM-algorithm with a greedy search. This is demonstrated to be faster and yield highly accurate results compared to earlier suggested clustering methods for the metagenomics application. Our model is fairly generic and could also be used for clustering of other types of sequence data for which Markov chains provide a reasonable way to compress information, as illustrated by experiments on shotgun sequence type data from an Escherichia coli strain. [ABSTRACT FROM AUTHOR]
Databáze: Complementary Index