Scalable Empirical Mixture Models That Account for Across-Site Compositional Heterogeneity

Autor: Gergely J. Szöllősi, Nicolas Lartillot, Dominik Schrempf
Přispěvatelé: Laboratoire de Biométrie et Biologie Evolutive - UMR 5558 (LBBE), Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-VetAgro Sup - Institut national d'enseignement supérieur et de recherche en alimentation, santé animale, sciences agronomiques et de l'environnement (VAS)-Centre National de la Recherche Scientifique (CNRS)
Jazyk: angličtina
Rok vydání: 2020
Předmět:
0106 biological sciences
Computer science
Process (engineering)
Biology
[SDV.BID.SPT]Life Sciences [q-bio]/Biodiversity/Systematics
Phylogenetics and taxonomy

AcademicSubjects/SCI01180
010603 evolutionary biology
01 natural sciences
03 medical and health sciences
Software
[SDV.BBM.GTP]Life Sciences [q-bio]/Biochemistry
Molecular Biology/Genomics [q-bio.GN]

empirical distribution mixture models
Genetics
Cluster (physics)
Range (statistics)
Methods
Cluster Analysis
Amino acid replacement
Molecular Biology
Ecology
Evolution
Behavior and Systematics

Phylogeny
030304 developmental biology
chemistry.chemical_classification
Long branch attraction
[STAT.AP]Statistics [stat]/Applications [stat.AP]
0303 health sciences
long-branch attraction
Phylogenetic tree
Models
Genetic

business.industry
[SDV.BID.EVO]Life Sciences [q-bio]/Biodiversity/Populations and Evolution [q-bio.PE]
AcademicSubjects/SCI01130
Mixture model
empirical profile mixture models
Empirical distribution function
Amino acid
phylogenetics
chemistry
Amino Acid Substitution
Genetic Techniques
Scalability
microsporidia
Biological system
business
[STAT.ME]Statistics [stat]/Methodology [stat.ME]
Zdroj: Molecular Biology and Evolution
Molecular Biology and Evolution, 2020, 37, pp.3616-3631. ⟨10.1093/molbev/msaa145⟩
ISSN: 1537-1719
0737-4038
DOI: 10.1093/molbev/msaa145⟩
Popis: Biochemical demands constrain the range of amino acids acceptable at specific sites resulting in across-site compositional heterogeneity of the amino acid replacement process. Phylogenetic models that disregard this heterogeneity are prone to systematic errors, which can lead to severe long branch attraction artifacts. State-of-the-art models accounting for across-site compositional heterogeneity include the CAT model, which is computationally expensive, and empirical distribution mixture models estimated via maximum likelihood (C10 to C60 models). Here, we present a new, scalable method EDCluster for finding empirical distribution mixture models involving a simple cluster analysis. The cluster analysis utilizes specific coordinate transformations which allow the detection of specialized amino acid distributions either from curated databases, or from the alignment at hand. We apply EDCluster to the HOGENOM and HSSP databases in order to provide universal distribution mixture (UDM) models comprising up to 4096 components. Detailed analyses of the UDM models demonstrate the removal of various long branch attraction artifacts and improved performance compared to the C10 to C60 models. Ready-to-use implementations of the UDM models are provided for three established software packages (IQ-TREE, Phylobayes, and RevBayes).
Databáze: OpenAIRE