Cross-modal Embeddings for Video and Audio Retrieval

Autor: Jordi Torres, Dídac Surís, Amaia Salvador, Amanda Duarte, Xavier Giro-i-Nieto
Přispěvatelé: Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions, Universitat Politècnica de Catalunya. GPI - Grup de Processament d'Imatge i Vídeo
Rok vydání: 2019
Předmět:
FOS: Computer and information sciences
Sound (cs.SD)
Computer science
Computer Vision and Pattern Recognition (cs.CV)
Speech recognition
Feature vector
InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL
Computer Science - Computer Vision and Pattern Recognition
02 engineering and technology
Imatges -- Processament
Computer Science - Sound
Computer Science - Information Retrieval
Neural networks (Computer science)
Image processing
Audio and Speech Processing (eess.AS)
Cross-modal
Machine learning
Aprenentatge automàtic
FOS: Electrical engineering
electronic engineering
information engineering

YouTube-8M
0202 electrical engineering
electronic engineering
information engineering

Xarxes neuronals (Informàtica)
Retrieval
Recall
020206 networking & telecommunications
Enginyeria de la telecomunicació::Processament del senyal::Processament de la imatge i del senyal vídeo [Àrees temàtiques de la UPC]
Modal
020201 artificial intelligence & image processing
Joint (audio engineering)
Feature learning
Information Retrieval (cs.IR)
Electrical Engineering and Systems Science - Audio and Speech Processing
Zdroj: Lecture Notes in Computer Science ISBN: 9783030110178
ECCV Workshops (4)
UPCommons. Portal del coneixement obert de la UPC
Universitat Politècnica de Catalunya (UPC)
Recercat. Dipósit de la Recerca de Catalunya
instname
DOI: 10.1007/978-3-030-11018-5_62
Popis: The increasing amount of online videos brings several opportunities for training self-supervised neural networks. The creation of large scale datasets of videos such as the YouTube-8M allows us to deal with this large amount of data in manageable way. In this work, we find new ways of exploiting this dataset by taking advantage of the multi-modal information it provides. By means of a neural network, we are able to create links between audio and visual documents, by projecting them into a common region of the feature space, obtaining joint audio-visual embeddings. These links are used to retrieve audio samples that fit well to a given silent video, and also to retrieve images that match a given a query audio. The results in terms of Recall@K obtained over a subset of YouTube-8M videos show the potential of this unsupervised approach for cross-modal feature learning. We train embeddings for both scales and assess their quality in a retrieval problem, formulated as using the feature extracted from one modality to retrieve the most similar videos based on the features computed in the other modality.
6 pages, 3 figures
Databáze: OpenAIRE