RAFTS 3 G: an efficient and versatile clustering software to analyses in large protein datasets.
Autor: | de Lima Nichio BT; Laboratory of Bioinformatics, Professional and Technical Education Sector from the Federal University of Paraná, Curitiba, PR, Brazil.; Department of Biochemistry, Biological Sciences Sector - Federal University of Paraná (UFPR), Curitiba, PR, Brazil., de Oliveira AMR; Laboratory of Bioinformatics, Professional and Technical Education Sector from the Federal University of Paraná, Curitiba, PR, Brazil., de Pierri CR; Laboratory of Bioinformatics, Professional and Technical Education Sector from the Federal University of Paraná, Curitiba, PR, Brazil.; Department of Biochemistry, Biological Sciences Sector - Federal University of Paraná (UFPR), Curitiba, PR, Brazil., Santos LGC; Laboratory of Bioinformatics, Professional and Technical Education Sector from the Federal University of Paraná, Curitiba, PR, Brazil., Lejambre AQ; Laboratory of Bioinformatics, Professional and Technical Education Sector from the Federal University of Paraná, Curitiba, PR, Brazil., Vialle RA; Laboratory of Bioinformatics, Professional and Technical Education Sector from the Federal University of Paraná, Curitiba, PR, Brazil., da Rocha Coimbra NA; Laboratory of Bioinformatics, Professional and Technical Education Sector from the Federal University of Paraná, Curitiba, PR, Brazil., Guizelini D; Laboratory of Bioinformatics, Professional and Technical Education Sector from the Federal University of Paraná, Curitiba, PR, Brazil., Marchaukoski JN; Laboratory of Bioinformatics, Professional and Technical Education Sector from the Federal University of Paraná, Curitiba, PR, Brazil., de Oliveira Pedrosa F; Laboratory of Bioinformatics, Professional and Technical Education Sector from the Federal University of Paraná, Curitiba, PR, Brazil.; Department of Biochemistry, Biological Sciences Sector - Federal University of Paraná (UFPR), Curitiba, PR, Brazil., Raittz RT; Laboratory of Bioinformatics, Professional and Technical Education Sector from the Federal University of Paraná, Curitiba, PR, Brazil. raittz@ufpr.br. |
---|---|
Jazyk: | angličtina |
Zdroj: | BMC bioinformatics [BMC Bioinformatics] 2019 Jul 15; Vol. 20 (1), pp. 392. Date of Electronic Publication: 2019 Jul 15. |
DOI: | 10.1186/s12859-019-2973-4 |
Abstrakt: | Background: Clustering methods are essential to partitioning biological samples being useful to minimize the information complexity in large datasets. Tools in this context usually generates data with greed algorithms that solves some Data Mining difficulties which can degrade biological relevant information during the clustering process. The lack of standardization of metrics and consistent bases also raises questions about the clustering efficiency of some methods. Benchmarks are needed to explore the full potential of clustering methods - in which alignment-free methods stand out - and the good choice of dataset makes it essentials. Results: Here we present a new approach to Data Mining in large protein sequences datasets, the Rapid Alignment Free Tool for Sequences Similarity Search to Groups (RAFTS 3 G), a method to clustering aiming of losing less biological information in the processes of generation groups. The strategy developed in our algorithm is optimized to be more astringent which reflects increase in accuracy and sensitivity in the generation of clusters in a wide range of similarity. RAFTS 3 G is the better choice compared to three main methods when the user wants more reliable result even ignoring the ideal threshold to clustering. Conclusion: In general, RAFTS 3 G is able to group up to millions of biological sequences into large datasets, which is a remarkable option of efficiency in clustering. RAFTS 3 G compared to other "standard-gold" methods in the clustering of large biological data maintains the balance between the reduction of biological information redundancy and the creation of consistent groups. We bring the binary search concept applied to grouped sequences which shows maintaining sensitivity/accuracy relation and up to minimize the time of data generated with RAFTS 3 G process. |
Databáze: | MEDLINE |
Externí odkaz: | |
Nepřihlášeným uživatelům se plný text nezobrazuje | K zobrazení výsledku je třeba se přihlásit. |