Phylogenetic mixture models for proteins

Autor:	Si Quang Le, Nicolas Lartillot, Olivier Gascuel
Přispěvatelé:	Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier (LIRMM), Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM), Méthodes et Algorithmes pour la Bioinformatique (MAB), Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM)
Rok vydání:	2008
Předmět:	0106 biological sciences JTT Biology Bioinformatics 010603 evolutionary biology 01 natural sciences General Biochemistry Genetics and Molecular Biology Evolution Molecular Matrix (chemical analysis) 03 medical and health sciences Phylogenetics phylogenetic inference Single amino acid Phylogeny 030304 developmental biology chemistry.chemical_classification Likelihood Functions 0303 health sciences maximum-likelihood estimations Models Genetic Phylogenetic tree Substitution (logic) Proteins Amino acid substitution CAT profile model Mixture model [SDV.BIBS]Life Sciences [q-bio]/Quantitative Methods [q-bio.QM] Amino acid amino acid replacement matrices Amino Acid Substitution chemistry Biochemistry [INFO.INFO-BI]Computer Science [cs]/Bioinformatics [q-bio.QM] General Agricultural and Biological Sciences WAG and LG Research Article
Zdroj:	Philosophical Transactions of the Royal Society B: Biological Sciences Philosophical Transactions of the Royal Society B: Biological Sciences, Royal Society, The, 2008, 363, pp.3965-3976. ⟨10.1098/rstb.2008.0180⟩
ISSN:	1471-2970 0962-8436
DOI:	10.1098/rstb.2008.0180
Popis:	Standard protein substitution models use a single amino acid replacement rate matrix that summarizes the biological, chemical and physical properties of amino acids. However, site evolution is highly heterogeneous and depends on many factors: genetic code; solvent exposure; secondary and tertiary structure; protein function; etc. These impact the substitution pattern and, in most cases, a single replacement matrix is not enough to represent all the complexity of the evolutionary processes. This paper explores in maximum-likelihood framework phylogenetic mixture models that combine several amino acid replacement matrices to better fit protein evolution. We learn these mixture models from a large alignment database extracted from HSSP, and test the performance using independent alignments from TreeBase . We compare unsupervised learning approaches, where the site categories are unknown, to supervised ones, where in estimations we use the known category of each site, based on its exposure or its secondary structure. All our models are combined with gamma-distributed rates across sites. Results show that highly significant likelihood gains are obtained when using mixture models compared with the best available single replacement matrices. Mixtures of matrices also improve over mixtures of profiles in the manner of the CAT model. The unsupervised approach tends to be better than the supervised one, but it appears difficult to implement and highly sensitive to the starting values of the parameters, meaning that the supervised approach is still of interest for initialization and model comparison. Using an unsupervised model involving three matrices, the average AIC gain per site with TreeBase test alignments is 0.31, 0.49 and 0.61 compared with LG (named after Le & Gascuel 2008 Mol. Biol. Evol. 25 , 1307–1320), WAG and JTT, respectively. This three-matrix model is significantly better than LG for 34 alignments (among 57), and significantly worse for 1 alignment only. Moreover, tree topologies inferred with our mixture models frequently differ from those obtained with single matrices, indicating that using these mixtures impacts not only the likelihood value but also the output tree. All our models and a PhyML implementation are available from http://atgc.lirmm.fr/mixtures .
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::9eda04e0a426c432cb268437a7a9f206 https://doi.org/10.1098/rstb.2008.0180 Zobrazit plný text záznamu