Burrows Wheeler Transform and Wavelet Tree Based Retrieval of Genome Sequence in an Indexed Genome Database

Autor:	Ranvijay, Agarwal Suneeta, Kumar Sanjeev
Rok vydání:	2021
Předmět:	Whole genome sequencing General Computer Science Burrows–Wheeler transform Computer science business.industry Wavelet Tree Genome database Pattern recognition Artificial intelligence business
Zdroj:	Recent Advances in Computer Science and Communications. 13:1213-1220
ISSN:	2666-2558
DOI:	10.2174/2213275912666190313162132
Popis:	Background: New generation sequencing machinery such as Illumina and Solexa can generate millions of reads from given genome sequence on a single run. There is a need for suitable data structure, efficient with respect to memory as well as time to align these enormous reads into reference genome. There are a number of existing techniques of indexing and reads alignment, such as MAQ, Bowtie, BWA, BWBBLE and Kart. Memory efficient versions of these techniques are 10- 20% slower than their respective normal versions. Objective: A new approach for efficient indexing and retrieval of large genomic data. Methods: In this paper, we propose an efficient method based on Burrows Wheeler Transform and Wavelet Tree (BWIT) for genome sequence indexing and reads alignment. Both types of alignments (exact and approximate) are possible by the proposed approach (BWIT). Results: The performance of BWIT is experimentally found to be better than existing ones with respect to both memory and speed. Experimental work shows that proposed approach performs best in case of protein sequence indexing. All the existing read alignment approaches depend upon the size of index used. In general, time required increases with reduction in index size used. Experiments have been done with Bowtie, BWA & Kart by taking index size as 1.25N, 1.05N, .98N, where N is the size of the text (reference genome). In our approach BWIT index size is .6N which is lesser than index size used in all other approaches. It is observed that even using smallest index size alignment time in our approach is least. Conclusion: An innovative indexing technique is presented to address the problem of storage, transmission and retrieval of large DNA/Protein Sequence data.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::044c472b53616be1711cf2f9c0abe225 https://doi.org/10.2174/2213275912666190313162132 Zobrazit plný text záznamu