Metric learning on biological sequence embeddings

Autor: James M. Hogan, Pravesh Biyani, Dhananjay Kimothi, Saket Anand, Ankita Shukla
Rok vydání: 2017
Předmět:
Zdroj: SPAWC
Popis: Embedding techniques such as word2vec [1] have gained popularity due to their ability to represent words and their semantic variants as real valued vectors. Biological sequence analysis may also leverage unsupervised feature representations, augmented with supervised learning techniques for tasks like retrieval and classification. Algorithms that rely on distance metrics are computationally efficient and can handle large datasets, however, default distances in the embedded space often yield inadequate accuracy. In this paper, we use class labels to learn a Mahalanobis distance in the embedded feature vector space and show performance improvements over the default Euclidean metric in both retrieval and classification tasks. The approach may be readily generalised, and is and applicable to a wide range of problems in sequence analysis and others involving discrete entities or segmented data streams.
Databáze: OpenAIRE