Popis: |
Motivation: Massive parallel phylogenetic analyses allow to reconstruct phylogenetic trees for every gene in genome, typically using the set of potential homologues detected via BLAST or analogue. However, if the amount of hits is too high, the dataset should be reduced to tractable size, preferably without human intervention. Currently available methods are error-prone on at least some datasets and some of them also depend on additional data which may not be available. Results: We propose a distance-based algorithm, termed Distant Joining, for phylogenetic dataset reduction that does not require any input besides sequences themselves. It was shown to be robust to both complex evolutionary histories and large data sets. We also discuss the assumptions and limitations of different sequence sampling approaches, and provide guidelines to selection of the method for a phylomic pipeline. Availability: Proof-of-concept Python implementation is available at https://github.com/SynedraAcus/sampler under the terms of CC-BY-4.0 license. Please check README for dependencies. Supplementary information: Supplementary data are available at Journal of Bioinformatics and Genomics online. |