Multi Stage Common Vector Space for Multimodal Embeddings
Autor: | Shagan Sah, Premkumar Udaiyar, Sabarish Gopalakrishnan, Raymond Ptucha |
---|---|
Rok vydání: | 2019 |
Předmět: |
Audio signal
Modalities Modality (human–computer interaction) Computer science business.industry Deep learning Pattern recognition 02 engineering and technology Translation (geometry) Convolutional neural network Recurrent neural network Encoding (memory) 0202 electrical engineering electronic engineering information engineering 020201 artificial intelligence & image processing Artificial intelligence business |
Zdroj: | AIPR |
DOI: | 10.1109/aipr47015.2019.9174583 |
Popis: | Deep learning frameworks have proven to be very effective at tasks like classification, segmentation, detection, and translation. Before being processed by a deep learning model, objects are first encoded into a suitable vector representation. For example, images are typically encoded using convolutional neural networks whereas texts typically use recurrent neural networks. Similarly, other modalities of data like 3D point clouds, audio signals, and videos can be transformed into vectors using appropriate encoders. Although deep learning architectures do a good job of learning these vector representations in isolation, learning a single common representation across multiple modalities is a challenging task. In this work, we develop a Multi Stage Common Vector Space (M-CVS) that is suitable for encoding multiple modalities. The M-CVS is an efficient low-dimensional vector representation in which the contextual similarity of data is preserved across all modalities through the use of contrastive loss functions. Our vector space can perform tasks like multimodal retrieval, searching and generation, where for example, images can be retrieved from text or audio input. The addition of a new modality would generally mean resetting and training the entire network. However, we introduce a stagewise learning technique where each modality is compared to a reference modality before being projected to the M-CVS. Our method ensures that a new modality can be mapped into the MCVS without changing existing encodings, allowing the extension to any number of modalities. We build and evaluate M-CVS on the XMedia and XMedianet multimodal dataset. Extensive ablation experiments using images, text, audio, video, and 3D point cloud modalities demonstrate the complexity vs. accuracy tradeoff under a wide variety of real-world use cases. |
Databáze: | OpenAIRE |
Externí odkaz: |