Popis: |
Lateral genetic transfer (LGT) refers to several processes by which microbes can take up, maintain and often integrate into their own genome genetic material from other organisms. It is widely accepted that LGT plays an important role in the evolution of microbial genomes, and in the ability of these organisms to adapt and exploit new ecological opportunities. Computational methods have been applied to detect LGT since the 1990s. Most classical approaches to inferring LGT follow the steps of delineating sets of orthologous sequences, multiple sequence alignment, phylogenetic inference and then finding incongruities between the topology of this tree and a reference tree. Most of these steps are computationally hard, so these methods lack scalability for analysis of very large datasets. With the ongoing development of new sequencing technologies in recent years, more and more sequences are becoming available for study, necessitating the development of new methods to detect LGT on large datasets. With access to lateral events, we can generate LGT networks in which nodes represent DNA carriers such as genomes or plasmids, and edges represent LGT events. By analysis of these networks, we can delineate genetic exchange communities (GECs), groups of organisms that have transferred genetic material amongst themselves, and study their properties. This thesis has three aims: 1) design and implement a method to detect LGT with high efficiency and effectiveness which can identify directionality of transfer; 2) apply this new method on empirical datasets to evaluate its performance, and build LGT networks based on the detections; and 3) analyse the LGT networks and identify genetic exchange communities. In Chapter 2, we develop an alignment-free method to detect LGT, based on term frequency – inverse document frequency (TF-IDF). TF-IDF is a concept from text mining, originally used to find the key words in a document. We treat genomes as documents and use k-mers (fixed-size short reads) to represent words. The genomes are arranged into groups, usually according to recognised biological relationships. If, in a sequence, we find a series of k-mers (separated from each other by no more than a gap of size G) that are infrequent within its own group, but frequent in a different group, then this segment is judged as lateral, with direction of transfer from the latter (donor) group into that (recipient) sequence. We tested this method on simulated datasets varying k, G and rates of nucleotide replacement within-group, between-group and post-LGT. We find that in many biologically relevant cases, the method performs effectively (precision and recall above 85%); it performs better if k is between 25 and 45, between-group distance is large, and within-group distance is small. We also compare our TF-IDF method with ALFY, another alignment-free method for LGT detection, on both simulated and empirical datasets (seven Staphylococcus aureus genomes). On the simulated datasets, TF-IDF exhibits slightly lower recall but much greater precision than ALFY. On the empirical dataset, TF-IDF finds all LGT events inferred by ALFY, as well as some other areas of interest including likely lateral regions containing antibiotic-resistance genes. TF-IDF runs much faster than ALFY on large datasets, but in the current implementation can be memory-intensive. These results establish TF-IDF as a competitive method for inferring LGT. In Chapter 3, I apply TF-IDF to three empirical datasets (genomes of 27 Escherichia coli and Shigella; 110 enteric bacteria; and 143 bacteria and archaea) to investigate its performance on datasets of different evolutionary breadth. We study the dependence of the method on k and G, and identify optimal parameters for a range of realistic scenarios. We observe an abundance of lateral transfers among groups of Escherichia coli and Shigella, and found indications of more-ancient transfers, which are otherwise difficult to detect. In the enteric bacteria dataset, most of the LGT signal comes from exchanges between E. coli and Shigella, but we could nonetheless recognise a lower rate of LGT with the other groups (except Yersinia). Few LGT events could be inferred between different phyla in the prokaryote dataset, as expected. We map these lateral regions to genes, and use enrichment tests to determine which biological process annotations are over- or under-represented among these lateral genes. In LGT networks, regions in which most nodes are interconnected with each other represent potential biological communities that exchange genetic material. In Chapter 4 we define cliques in LGT networks as genetic exchange communities (GECs). We are interested in the taxonomic and physiological nature of these GECs, and whether their members share common environments. Finding cliques (or near-cliques) in networks is an NP-hard problem; however, there exist several good heuristic methods for this, many of which are implemented in the software package GrAPPA). In this chapter we use GrAPPA to identify GECs in the datasets we studied in Chapter 3. By varying the parameter values of TF-IDF, we can identify phyla or classes that persist as members of GECs, and which are more transient in this sense. We then apply enrichment tests to identify the biological processes that underlie these GECs. Overall, this project has introduced new capabilities, generated new understanding and opened new perspectives in our understanding of LGT among bacteria and archaea. Using the TF-IDF method we can detect LGT in large genome-scale datasets, and for the first time systematically infer the directionality of transfer. The concept of GEC sheds new light on the processes behind lateral transfer, and will allow researchers to better understand the mechanisms and conditions behind LGT. |