End2End Acoustic to Semantic Transduction

Autor:	Renato De Mori, Antoine Laurent, Sylvain Meignier, Antoine Caubrière, Valentin Pelloin, Nathalie Camelin, Yannick Estève
Přispěvatelé:	Laboratoire d'Informatique de l'Université du Mans (LIUM), Le Mans Université (UM), Laboratoire Informatique d'Avignon (LIA), Avignon Université (AU)-Centre d'Enseignement et de Recherche en Informatique - CERI, McGill University = Université McGill [Montréal, Canada]
Rok vydání:	2021
Předmět:	FOS: Computer and information sciences Sound (cs.SD) Computer science Speech recognition Feature extraction Word error rate Context (language use) 02 engineering and technology Transduction (psychology) Semantics [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL] Computer Science - Sound [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI] Reduction (complexity) 030507 speech-language pathology & audiology 03 medical and health sciences [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG] Audio and Speech Processing (eess.AS) 0202 electrical engineering electronic engineering information engineering FOS: Electrical engineering electronic engineering information engineering Computer Science - Computation and Language [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD] 020201 artificial intelligence & image processing Language model 0305 other medical science Computation and Language (cs.CL) Spoken language Electrical Engineering and Systems Science - Audio and Speech Processing
Zdroj:	ICASSP ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Jun 2021, Toronto, ON, Canada. ⟨10.1109/ICASSP39728.2021.9413581⟩
DOI:	10.48550/arxiv.2102.01013
Popis:	In this paper, we propose a novel end-to-end sequence-to-sequence spoken language understanding model using an attention mechanism. It reliably selects contextual acoustic features in order to hypothesize semantic contents. An initial architecture capable of extracting all pronounced words and concepts from acoustic spans is designed and tested. With a shallow fusion language model, this system reaches a 13.6 concept error rate (CER) and an 18.5 concept value error rate (CVER) on the French MEDIA corpus, achieving an absolute 2.8 points reduction compared to the state-of-the-art. Then, an original model is proposed for hypothesizing concepts and their values. This transduction reaches a 15.4 CER and a 21.6 CVER without any new type of context. Comment: Accepted at IEEE ICASSP 2021
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::a1d4edfd570189d628f12c702fc27fa6 Zobrazit plný text záznamu