Linear time minimum segmentation enables scalable founder reconstruction

Autor: Bastien Cazaux, Tuukka Norri, Dmitry Kosolobov, Veli Mäkinen
Přispěvatelé: Genome-scale Algorithmics research group / Veli Mäkinen, Department of Computer Science, Bioinformatics, Algorithmic Bioinformatics
Rok vydání: 2019
Předmět:
lcsh:QH426-470
Burrows–Wheeler transform
0206 medical engineering
02 engineering and technology
Disjoint sets
FOUNDER RECONSTRUCTION
PAN-GENOME INDEXING
Dynamic programming
Range minimum query
Combinatorics
Set (abstract data type)
BURROWS-WHEELER TRANSFORM
Structural Biology
RANGE MINIMUM QUERY
Positional Burrows–Wheeler transform
Partition (number theory)
POSITIONAL BURROWS-WHEELER TRANSFORM
Segmentation
lcsh:QH301-705.5
Molecular Biology
Time complexity
Pan-genome indexing
Mathematics
Research
Applied Mathematics
Positional Burrows-Wheeler transform
113 Computer and information sciences
Substring
lcsh:Genetics
DYNAMIC PROGRAMMING
lcsh:Biology (General)
Computational Theory and Mathematics
QUERIES
1182 Biochemistry
cell and molecular biology

Founder reconstruction
020602 bioinformatics
STORAGE
Zdroj: Algorithms for Molecular Biology, Vol 14, Iss 1, Pp 1-15 (2019)
Algorithms for Molecular Biology : AMB
Algorithms for Molecular Biology
ISSN: 1748-7188
DOI: 10.1186/s13015-019-0147-6
Popis: Background We study a preprocessing routine relevant in pan-genomic analyses: consider a set of aligned haplotype sequences of complete human chromosomes. Due to the enormous size of such data, one would like to represent this input set with a few founder sequences that retain as well as possible the contiguities of the original sequences. Such a smaller set gives a scalable way to exploit pan-genomic information in further analyses (e.g. read alignment and variant calling). Optimizing the founder set is an NP-hard problem, but there is a segmentation formulation that can be solved in polynomial time, defined as follows. Given a threshold L and a set \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {R}} = \{R_1, \ldots , R_m\}$$\end{document}R={R1,…,Rm} of m strings (haplotype sequences), each having length n, the minimum segmentation problem for founder reconstruction is to partition [1, n] into set P of disjoint segments such that each segment \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$[a,b] \in P$$\end{document}[a,b]∈P has length at least L and the number \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d(a,b)=|\{R_i[a,b] :1\le i \le m\}|$$\end{document}d(a,b)=|{Ri[a,b]:1≤i≤m}| of distinct substrings at segment [a, b] is minimized over \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$[a,b] \in P$$\end{document}[a,b]∈P. The distinct substrings in the segments represent founder blocks that can be concatenated to form \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\max \{ d(a,b) :[a,b] \in P \}$$\end{document}max{d(a,b):[a,b]∈P} founder sequences representing the original \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {R}}$$\end{document}R such that crossovers happen only at segment boundaries. Results We give an O(mn) time (i.e. linear time in the input size) algorithm to solve the minimum segmentation problem for founder reconstruction, improving over an earlier \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O(mn^2)$$\end{document}O(mn2). Conclusions Our improvement enables to apply the formulation on an input of thousands of complete human chromosomes. We implemented the new algorithm and give experimental evidence on its practicality. The implementation is available in https://github.com/tsnorri/founder-sequences.
Databáze: OpenAIRE