Audio Captioning with Composition of Acoustic and Semantic Information

Autor:	Mustafa Sert, Aysegul Ozkaya Eren
Rok vydání:	2021
Předmět:	Closed captioning Computer Science - Machine Learning Linguistics and Language Computer Networks and Communications business.industry Computer science 02 engineering and technology computer.software_genre Computer Science - Sound Computer Science Applications Artificial Intelligence 020204 information systems 0202 electrical engineering electronic engineering information engineering 020201 artificial intelligence & image processing Artificial intelligence CLIPS Semantic information business computer Composition (language) Software Natural language processing Electrical Engineering and Systems Science - Audio and Speech Processing Information Systems computer.programming_language
Zdroj:	International Journal of Semantic Computing. 15:143-160
ISSN:	1793-7108 1793-351X
DOI:	10.1142/s1793351x21400018
Popis:	Generating audio captions is a new research area that combines audio and natural language processing to create meaningful textual descriptions for audio clips. To address this problem, previous studies mostly use the encoder-decoder based models without considering semantic information. To fill this gap, we present a novel encoder-decoder architecture using bi-directional Gated Recurrent Units (BiGRU) with audio and semantic embeddings. We extract semantic embedding by obtaining subjects and verbs from the audio clip captions and combine these embedding with audio embedding to feed the BiGRU-based encoder-decoder model. To enable semantic embeddings for the test audios, we introduce a Multilayer Perceptron classifier to predict the semantic embeddings of those clips. We also present exhaustive experiments to show the efficiency of different features and datasets for our proposed model the audio captioning task. To extract audio features, we use the log Mel energy features, VGGish embeddings, and a pretrained audio neural network (PANN) embeddings. Extensive experiments on two audio captioning datasets Clotho and AudioCaps show that our proposed model outperforms state-of-the-art audio captioning models across different evaluation metrics and using the semantic information improves the captioning performance. Keywords: Audio captioning; PANNs; VGGish; GRU; BiGRU. Comment: Accepted for publication in International Journal of Semantic Computing. arXiv admin note: substantial text overlap with arXiv:2006.03391
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::a239011b4d0502ab4af0124c82eead02 https://doi.org/10.1142/s1793351x21400018 Zobrazit plný text záznamu