A Novel Approach for Genome Data Classification Using Hadoop and Spark Framework

Autor: Nagamma Patil, Shailesh S. Tayde
Rok vydání: 2016
Předmět:
Zdroj: Emerging Research in Computing, Information, Communication and Applications ISBN: 9789811002861
DOI: 10.1007/978-981-10-0287-8_31
Popis: Biological data classification is an active area of research. It is difficult to differentiate among different genomes of a given species. This paper deals with the classification of a genome species like cat, rat on a Hadoop and Spark framework. The sequence encoding schema involved the n-gram method to convert a sequence into integral and the features are extracted using this n-gram with the help of pattern-matching techniques K-distance approximate (KDA) pattern matching and multiple reference character algorithm (MRCA) pattern matching. A support vector machine (SVM) classifier was trained using the features extracted from the genome datasets of cat and rat. The fastness of this feature extraction and classification was achieved by implementing them on the Hadoop and Spark framework.
Databáze: OpenAIRE