Translating speech with just images

Autor:	Oneata, Dan, Kamper, Herman
Rok vydání:	2024
Předmět:	Electrical Engineering and Systems Science - Audio and Speech Processing Computer Science - Computation and Language Computer Science - Sound
Druh dokumentu:	Working Paper
Popis:	Visually grounded speech models link speech to images. We extend this connection by linking images to text via an existing image captioning system, and as a result gain the ability to map speech audio directly to text. This approach can be used for speech translation with just images by having the audio in a different language from the generated captions. We investigate such a system on a real low-resource language, Yor\`ub\'a, and propose a Yor\`ub\'a-to-English speech translation model that leverages pretrained components in order to be able to learn in the low-resource regime. To limit overfitting, we find that it is essential to use a decoding scheme that produces diverse image captions for training. Results show that the predicted translations capture the main semantics of the spoken audio, albeit in a simpler and shorter form. Comment: Accepted at Interspeech 2024
Databáze:	arXiv
Externí odkaz:	http://arxiv.org/abs/2406.07133 Zobrazit plný text záznamu View this record from Arxiv