Cross-modal Embeddings for Video and Audio Retrieval
Autor: | Jordi Torres, Dídac Surís, Amaia Salvador, Amanda Duarte, Xavier Giro-i-Nieto |
---|---|
Přispěvatelé: | Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Universitat Politècnica de Catalunya. GPI - Grup de Processament d'Imatge i Vídeo |
Rok vydání: | 2019 |
Předmět: |
FOS: Computer and information sciences
Sound (cs.SD) Computer science Computer Vision and Pattern Recognition (cs.CV) Speech recognition Feature vector InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL Computer Science - Computer Vision and Pattern Recognition 02 engineering and technology Imatges -- Processament Computer Science - Sound Computer Science - Information Retrieval Neural networks (Computer science) Image processing Audio and Speech Processing (eess.AS) Cross-modal Machine learning Aprenentatge automàtic FOS: Electrical engineering electronic engineering information engineering YouTube-8M 0202 electrical engineering electronic engineering information engineering Xarxes neuronals (Informàtica) Retrieval Recall 020206 networking & telecommunications Enginyeria de la telecomunicació::Processament del senyal::Processament de la imatge i del senyal vídeo [Àrees temàtiques de la UPC] Modal 020201 artificial intelligence & image processing Joint (audio engineering) Feature learning Information Retrieval (cs.IR) Electrical Engineering and Systems Science - Audio and Speech Processing |
Zdroj: | Lecture Notes in Computer Science ISBN: 9783030110178 ECCV Workshops (4) UPCommons. Portal del coneixement obert de la UPC Universitat Politècnica de Catalunya (UPC) Recercat. Dipósit de la Recerca de Catalunya instname |
DOI: | 10.1007/978-3-030-11018-5_62 |
Popis: | The increasing amount of online videos brings several opportunities for training self-supervised neural networks. The creation of large scale datasets of videos such as the YouTube-8M allows us to deal with this large amount of data in manageable way. In this work, we find new ways of exploiting this dataset by taking advantage of the multi-modal information it provides. By means of a neural network, we are able to create links between audio and visual documents, by projecting them into a common region of the feature space, obtaining joint audio-visual embeddings. These links are used to retrieve audio samples that fit well to a given silent video, and also to retrieve images that match a given a query audio. The results in terms of Recall@K obtained over a subset of YouTube-8M videos show the potential of this unsupervised approach for cross-modal feature learning. We train embeddings for both scales and assess their quality in a retrieval problem, formulated as using the feature extracted from one modality to retrieve the most similar videos based on the features computed in the other modality. 6 pages, 3 figures |
Databáze: | OpenAIRE |
Externí odkaz: |