A cautionary note on the use of unsupervised machine learning algorithms to characterise malaria parasite population structure from genetic distance matrices

Autor:	Aimee R. Taylor, Arjen M. Dondorp, James A Watson, Christopher Holmes, Caroline O. Buckee, Nicholas J. White, Elizabeth A. Ashley
Přispěvatelé:	Intensive Care Medicine
Jazyk:	angličtina
Rok vydání:	2020
Předmět:	Plasmodium Cancer Research Epidemiology Computer science Population structure Drug Resistance Population genetics QH426-470 computer.software_genre Biochemistry Genome Machine Learning Medical Conditions 0302 clinical medicine Medicine and Health Sciences Feature (machine learning) Cluster Analysis Parasite hosting Malaria Falciparum Genetics (clinical) Protozoans Molecular Epidemiology 0303 health sciences biology Applied Mathematics Simulation and Modeling Malarial Parasites Eukaryota 3. Good health Nucleic acids Genetic Epidemiology Physical Sciences Unsupervised learning Cambodia Malaria control Algorithm Algorithms Research Article Computer and Information Sciences Genotype DNA recombination Plasmodium falciparum Research and Analysis Methods Machine learning Machine Learning Algorithms Antimalarials 03 medical and health sciences Data visualization Artificial Intelligence Parasite Groups parasitic diseases Parasitic Diseases Genetics medicine Humans Molecular Biology Ecology Evolution Behavior and Systematics 030304 developmental biology Evolutionary Biology Population Biology business.industry Organisms Biology and Life Sciences Statistical model DNA Tropical Diseases biology.organism_classification medicine.disease Parasitic Protozoans Malaria Genetics Population Genetic distance Genetic epidemiology Parasitology Artificial intelligence business Apicomplexa computer Mathematics Population Genetics 030217 neurology & neurosurgery Unsupervised Machine Learning
Zdroj:	PLoS genetics, 16(10):e1009037. Public Library of Science PLoS Genetics PLoS Genetics, Vol 16, Iss 10, p e1009037 (2020)
ISSN:	1553-7390
Popis:	Genetic surveillance of malaria parasites supports malaria control programmes, treatment guidelines and elimination strategies. Surveillance studies often pose questions about malaria parasite ancestry (e.g. how antimalarial resistance has spread) and employ statistical methods that characterise parasite population structure. Many of the methods used to characterise structure are unsupervised machine learning algorithms which depend on a genetic distance matrix, notably principal coordinates analysis (PCoA) and hierarchical agglomerative clustering (HAC). PCoA and HAC are sensitive to both the definition of genetic distance and algorithmic specification. Importantly, neither algorithm infers malaria parasite ancestry. As such, PCoA and HAC can inform (e.g. via exploratory data visualisation and hypothesis generation), but not answer comprehensively, key questions about malaria parasite ancestry. We illustrate the sensitivity of PCoA and HAC using 393 Plasmodium falciparum whole genome sequences collected from Cambodia and neighbouring regions (where antimalarial resistance has emerged and spread recently) and we provide tentative guidance for the use and interpretation of PCoA and HAC in malaria parasite genetic epidemiology. This guidance includes a call for fully transparent and reproducible analysis pipelines that feature (i) a clearly outlined scientific question; (ii) a clear justification of analytical methods used to answer the scientific question along with discussion of any inferential limitations; (iii) publicly available genetic distance matrices when downstream analyses depend on them; and (iv) sensitivity analyses. To bridge the inferential disconnect between the output of non-inferential unsupervised learning algorithms and the scientific questions of interest, tailor-made statistical models are needed to infer malaria parasite ancestry. In the absence of such models speculative reasoning should feature only as discussion but not as results. Author summary Genetic epidemiology studies of malaria attempt to characterise what is happening in malaria parasite populations. In particular, they are an important tool to track the spread of drug resistance and to validate genetic markers of drug resistance. To make sense of parasite genetic data, researchers usually characterise the population structure using statistical methods. This is most often done as a two step process. The first is a data reduction step, whereby the data are summarised into a distance matrix (each entry represents the genetic distance between two isolates). The distance matrix is then input into an unsupervised machine learning algorithm. Principal coordinates analysis and hierarchical agglomerative clustering are the two most popular unsupervised machine learning algorithms used for this purpose in malaria genetic epidemiology. We highlight that this procedure is sensitive to the choice of genetic distance and to the specification of the algorithms. These unsupervised methods are useful for exploratory data analysis but cannot be used to infer historical events. We provide some guidance on how to make genetic epidemiology analyses more transparent and reproducible.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::7da8213ab0575f72b258a0a72f0737ad http://www.scopus.com/inward/record.url?scp=85092928737&partnerID=8YFLogxK Zobrazit plný text záznamu Plný text ve formátu PDF Plný text ve formátu HTML
Nepřihlášeným uživatelům se plný text nezobrazuje	K zobrazení výsledku je třeba se přihlásit.