conLSH: Context based Locality Sensitive Hashing for Mapping of noisy SMRT Reads

Autor:	Angana Chakraborty, Sanghamitra Bandyopadhyay
Jazyk:	angličtina
Rok vydání:	2019
Předmět:	0301 basic medicine FOS: Computer and information sciences Computer Science - Machine Learning Time Factors Closeness Machine Learning (stat.ML) Context based Biochemistry Locality-sensitive hashing Machine Learning (cs.LG) 03 medical and health sciences 0302 clinical medicine Structural Biology Statistics - Machine Learning Probability of error Computer Science - Data Structures and Algorithms Quantitative Biology - Genomics Data Structures and Algorithms (cs.DS) Genomics (q-bio.GN) business.industry Organic Chemistry Search engine indexing Computational Biology Pattern recognition Sequence Analysis DNA Computational Mathematics 030104 developmental biology 030220 oncology & carcinogenesis Bounded function FOS: Biological sciences Artificial intelligence business Algorithm Algorithms Software Reference genome
DOI:	10.1101/574467
Popis:	Single Molecule Real-Time (SMRT) sequencing is a recent advancement of Next Gen technology developed by Pacific Bio (PacBio). It comes with an explosion of long and noisy reads demanding cutting edge research to get most out of it. To deal with the high error probability of SMRT data, a novel contextual Locality Sensitive Hashing (conLSH) based algorithm is proposed in this article, which can effectively align the noisy SMRT reads to the reference genome. Here, sequences are hashed together based not only on their closeness, but also on similarity of context. The algorithm has $\mathcal{O}(n^{\rho+1})$ space requirement, where $n$ is the number of sequences in the corpus and $\rho$ is a constant. The indexing time and querying time are bounded by $\mathcal{O}( \frac{n^{\rho+1} \cdot \ln n}{\ln \frac{1}{P_2}})$ and $\mathcal{O}(n^\rho)$ respectively, where $P_2 > 0$, is a probability value. This algorithm is particularly useful for retrieving similar sequences, a widely used task in biology. The proposed conLSH based aligner is compared with rHAT, popularly used for aligning SMRT reads, and is found to comprehensively beat it in speed as well as in memory requirements. In particular, it takes approximately $24.2\%$ less processing time, while saving about $70.3\%$ in peak memory requirement for H.sapiens PacBio dataset. Comment: arXiv admin note: text overlap with arXiv:1705.03933
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::eae390ff1214a436e5a297244933738b Zobrazit plný text záznamu