Fast and scalable protein motif sequence clustering based on Hadoop framework

Autor: Mohammad Amin Nikbakht, Mahsa Asadi, Nasser Ghadiri, Sylvain Pitre, Erfan Farhangi
Rok vydání: 2017
Předmět:
Zdroj: 2017 3th International Conference on Web Research (ICWR).
DOI: 10.1109/icwr.2017.7959300
Popis: In recent years, we are faced with large amounts of sporadic unstructured data on the web. With the explosive growth of such data, there is a growing need for effective methods such as clustering to analyze and extract information. Biological data forms an important part of unstructured data on the web. Protein sequence databases are considered as a primary source of biological data. Clustering can help to organize sequences into homologous and functionally similar groups and can improve the speed of data processing and analysis. Proteins are responsible for most of the activities in cells. The majority of proteins show their function through interaction with other proteins. Hence, prediction of protein interactions is an important research area in the biomedical sciences. Motifs are fragments frequently occurred in protein sequences. A well- known method to specify the protein interaction is based on motif Clustering. Existing works on motif clustering methods share the problem of limitation in the number of clusters. However, regarding the vast amount of motifs and the necessity of a large number of clusters, it seems that an efficient, scalable and fast method is necessary to cluster such large number of sequences. In this paper, we propose a novel approach to cluster a large number of motifs. Our approach includes extracting motifs within protein sequences, feature selection, preprocessing, dimension reduction and utilizing BigFCM (a large-scale fuzzy clustering) on several distributed nodes with Hadoop framework to take the advantage of MapReduce Programming. Experimental Results show very good Performance of our approach.
Databáze: OpenAIRE