Mining repetitive sequences using a big data ecosystem

Autor: Michael A. Phinney, Hongfei Cao, Chi-Ren Shyu, Andi Dhroso
Rok vydání: 2013
Předmět:
Zdroj: BIBM
DOI: 10.1109/bibm.2013.6732763
Popis: Identifying repetitive gene sequences occurring within DNA sequences that span a collection of species is a challenge that is conceptually simple yet computationally challenging. Biological research suggests that certain regions within genomic sequences may be unchanged for hundreds of millions of years; understanding and identifying these highly preserved regions is a major challenge faced by bioinformaticians. Taking an evolutionary perspective on DNA, pinpointing these repetitive sequences is the first step to understanding functional similarities and diversities. The difficulty of this problem arises from the volume of the data required for analysis; it grows with every genome that is sequenced. Traditional approaches used to identify repetitive sequences often require the pair-wise comparison of chromosomes, which takes a significant amount of time to gather results. When comparing n chromosomes, n(n-l) individual comparisons must be made. To avoid exhaustive pair-wise comparisons, we designed an algorithm that partitions genomic sequences into search key values representing potential repetitive sequences, which are hashed into bins. With the introduction of new genomes, we only process the new sequences and aggregate new results with those that were previously processed.
Databáze: OpenAIRE