Metric learning on biological sequence embeddings
Autor: | James M. Hogan, Pravesh Biyani, Dhananjay Kimothi, Saket Anand, Ankita Shukla |
---|---|
Rok vydání: | 2017 |
Předmět: |
0301 basic medicine
Mahalanobis distance Data stream mining business.industry Computer science Feature vector Supervised learning Machine learning computer.software_genre Euclidean distance 03 medical and health sciences 030104 developmental biology Leverage (statistics) Embedding Word2vec Artificial intelligence business computer |
Zdroj: | SPAWC |
Popis: | Embedding techniques such as word2vec [1] have gained popularity due to their ability to represent words and their semantic variants as real valued vectors. Biological sequence analysis may also leverage unsupervised feature representations, augmented with supervised learning techniques for tasks like retrieval and classification. Algorithms that rely on distance metrics are computationally efficient and can handle large datasets, however, default distances in the embedded space often yield inadequate accuracy. In this paper, we use class labels to learn a Mahalanobis distance in the embedded feature vector space and show performance improvements over the default Euclidean metric in both retrieval and classification tasks. The approach may be readily generalised, and is and applicable to a wide range of problems in sequence analysis and others involving discrete entities or segmented data streams. |
Databáze: | OpenAIRE |
Externí odkaz: |