A grammar-based distance metric enables fast and accurate clustering of large sets of 16S sequences

Autor:	Benson Andrew K, Way Samuel F, Russell David J, Sayood Khalid
Jazyk:	angličtina
Rok vydání:	2010
Předmět:	Computer applications to medicine. Medical informatics R858-859.7 Biology (General) QH301-705.5
Zdroj:	BMC Bioinformatics, Vol 11, Iss 1, p 601 (2010)
Druh dokumentu:	article
ISSN:	1471-2105
DOI:	10.1186/1471-2105-11-601
Popis:	Abstract Background We propose a sequence clustering algorithm and compare the partition quality and execution time of the proposed algorithm with those of a popular existing algorithm. The proposed clustering algorithm uses a grammar-based distance metric to determine partitioning for a set of biological sequences. The algorithm performs clustering in which new sequences are compared with cluster-representative sequences to determine membership. If comparison fails to identify a suitable cluster, a new cluster is created. Results The performance of the proposed algorithm is validated via comparison to the popular DNA/RNA sequence clustering approach, CD-HIT-EST, and to the recently developed algorithm, UCLUST, using two different sets of 16S rDNA sequences from 2,255 genera. The proposed algorithm maintains a comparable CPU execution time with that of CD-HIT-EST which is much slower than UCLUST, and has successfully generated clusters with higher statistical accuracy than both CD-HIT-EST and UCLUST. The validation results are especially striking for large datasets. Conclusions We introduce a fast and accurate clustering algorithm that relies on a grammar-based sequence distance. Its statistical clustering quality is validated by clustering large datasets containing 16S rDNA sequences.
Databáze:	Directory of Open Access Journals
Externí odkaz:	https://doaj.org/article/4efe6a03f91a46a18813ba4c00728bab Zobrazit plný text záznamu View record in DOAJ Plný text ve formátu PDF