Scalable Empirical Mixture Models That Account for Across-Site Compositional Heterogeneity
Autor: | Gergely J. Szöllősi, Nicolas Lartillot, Dominik Schrempf |
---|---|
Přispěvatelé: | Laboratoire de Biométrie et Biologie Evolutive - UMR 5558 (LBBE), Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-VetAgro Sup - Institut national d'enseignement supérieur et de recherche en alimentation, santé animale, sciences agronomiques et de l'environnement (VAS)-Centre National de la Recherche Scientifique (CNRS) |
Jazyk: | angličtina |
Rok vydání: | 2020 |
Předmět: |
0106 biological sciences
Computer science Process (engineering) Biology [SDV.BID.SPT]Life Sciences [q-bio]/Biodiversity/Systematics Phylogenetics and taxonomy AcademicSubjects/SCI01180 010603 evolutionary biology 01 natural sciences 03 medical and health sciences Software [SDV.BBM.GTP]Life Sciences [q-bio]/Biochemistry Molecular Biology/Genomics [q-bio.GN] empirical distribution mixture models Genetics Cluster (physics) Range (statistics) Methods Cluster Analysis Amino acid replacement Molecular Biology Ecology Evolution Behavior and Systematics Phylogeny 030304 developmental biology chemistry.chemical_classification Long branch attraction [STAT.AP]Statistics [stat]/Applications [stat.AP] 0303 health sciences long-branch attraction Phylogenetic tree Models Genetic business.industry [SDV.BID.EVO]Life Sciences [q-bio]/Biodiversity/Populations and Evolution [q-bio.PE] AcademicSubjects/SCI01130 Mixture model empirical profile mixture models Empirical distribution function Amino acid phylogenetics chemistry Amino Acid Substitution Genetic Techniques Scalability microsporidia Biological system business [STAT.ME]Statistics [stat]/Methodology [stat.ME] |
Zdroj: | Molecular Biology and Evolution Molecular Biology and Evolution, 2020, 37, pp.3616-3631. ⟨10.1093/molbev/msaa145⟩ |
ISSN: | 1537-1719 0737-4038 |
DOI: | 10.1093/molbev/msaa145⟩ |
Popis: | Biochemical demands constrain the range of amino acids acceptable at specific sites resulting in across-site compositional heterogeneity of the amino acid replacement process. Phylogenetic models that disregard this heterogeneity are prone to systematic errors, which can lead to severe long branch attraction artifacts. State-of-the-art models accounting for across-site compositional heterogeneity include the CAT model, which is computationally expensive, and empirical distribution mixture models estimated via maximum likelihood (C10 to C60 models). Here, we present a new, scalable method EDCluster for finding empirical distribution mixture models involving a simple cluster analysis. The cluster analysis utilizes specific coordinate transformations which allow the detection of specialized amino acid distributions either from curated databases, or from the alignment at hand. We apply EDCluster to the HOGENOM and HSSP databases in order to provide universal distribution mixture (UDM) models comprising up to 4096 components. Detailed analyses of the UDM models demonstrate the removal of various long branch attraction artifacts and improved performance compared to the C10 to C60 models. Ready-to-use implementations of the UDM models are provided for three established software packages (IQ-TREE, Phylobayes, and RevBayes). |
Databáze: | OpenAIRE |
Externí odkaz: |