Evolutionary computing strategies for the detection of conserved patterns in genomic DNA

Autor: Beiko, Robert G
Jazyk: angličtina
Rok vydání: 2003
Předmět:
Druh dokumentu: Diplomová práce
DOI: 10.20381/ruor-19547
Popis: The detection of regulatory sequences in DNA is a challenging problem, especially when considered in the context of whole genomes. The degree of sequence conservation of regulatory protein binding sites is often weak, and the sites are obscured by surrounding intergenic sequence. Since structural interactions are vital for protein-DNA interactions, structural representations of regulatory sites can yield a more accurate model and a better understanding of within-site variability. However, the use of multiple alternative representations of DNA introduces a requirement for novel algorithms that can create and test different combinations of DNA features. The Genetic Algorithm Neural Network (GANN) was designed to identify combinations of patterns that can be used to distinguish between different classes of training sequence. GANN trains a set of artificial neural networks to classify sets of sequence using either backpropagation or a genetic algorithm, and uses an 'outer genetic algorithm' to choose the best inputs from a pool of DNA features that can include sequence, structure, and weight matrix representations. When trained with a subset of upstream sequences from a whole genome, GANN was able to detect patterns such as the Shine-Dalgarno sequence in Escherichia coli K12, and sequences consistent with archaeal promoters in the archaeon Sulfolobus solfataricus P2. The Motif Genetic Algorithm (MGA) constructs motif representations by concatenating minimal units of DNA sequence and structure. This algorithm was used to model conserved patterns in DNA, including the binding sites for E. coli cyclic AMP activated protein (CAP), integration host factor (IHF), and two different promoter types recognized by alternative bacterial sigma factors. The CAP models were used to detect other putative binding sites in upstream regions of the E. coli K12 genome, while attempts to train an accurate model of IHF binding sites revealed an important role for structural representations in motif modeling.
Databáze: Networked Digital Library of Theses & Dissertations