Zobrazeno 1 - 10
of 3 606
pro vyhledávání: '"Audio-visual speech recognition"'
Autor:
Cappellazzo, Umberto, Kim, Minsu, Chen, Honglie, Ma, Pingchuan, Petridis, Stavros, Falavigna, Daniele, Brutti, Alessio, Pantic, Maja
Multimodal large language models (MLLMs) have recently become a focal point of research due to their formidable multimodal understanding capabilities. For example, in the audio and speech domains, an LLM can be equipped with (automatic) speech recogn
Externí odkaz:
http://arxiv.org/abs/2409.12319
Speech recognition is the technology that enables machines to interpret and process human speech, converting spoken language into text or commands. This technology is essential for applications such as virtual assistants, transcription services, and
Externí odkaz:
http://arxiv.org/abs/2409.00481
Recent advances in Audio-Visual Speech Recognition (AVSR) have led to unprecedented achievements in the field, improving the robustness of this type of system in adverse, noisy environments. In most cases, this task has been addressed through the des
Externí odkaz:
http://arxiv.org/abs/2407.06606
Audio-visual speech recognition (AVSR) aims to transcribe human speech using both audio and video modalities. In practical environments with noise-corrupted audio, the role of video information becomes crucial. However, prior works have primarily foc
Externí odkaz:
http://arxiv.org/abs/2407.03563
Humans naturally perform audiovisual speech recognition (AVSR), enhancing the accuracy and robustness by integrating auditory and visual information. Spiking neural networks (SNNs), which mimic the brain's information-processing mechanisms, are well-
Externí odkaz:
http://arxiv.org/abs/2408.16564
In this work, we present SynesLM, an unified model which can perform three multimodal language understanding tasks: audio-visual automatic speech recognition(AV-ASR) and visual-aided speech/machine translation(VST/VMT). Unlike previous research that
Externí odkaz:
http://arxiv.org/abs/2408.00624
Autor:
Narayan, Sanath, Djilali, Yasser Abdelaziz Dahou, Singh, Ankit, Bihan, Eustache Le, Hacid, Hakim
This work presents an extensive and detailed study on Audio-Visual Speech Recognition (AVSR) for five widely spoken languages: Chinese, Spanish, English, Arabic, and French. We have collected large-scale datasets for each language except for English,
Externí odkaz:
http://arxiv.org/abs/2406.00038
Autor:
Rouditchenko, Andrew, Gong, Yuan, Thomas, Samuel, Karlinsky, Leonid, Kuehne, Hilde, Feris, Rogerio, Glass, James
Audio-Visual Speech Recognition (AVSR) uses lip-based video to improve performance in noise. Since videos are harder to obtain than audio, the video training data of AVSR models is usually limited to a few thousand hours. In contrast, speech models s
Externí odkaz:
http://arxiv.org/abs/2406.10082
Akademický článek
Tento výsledek nelze pro nepřihlášené uživatele zobrazit.
K zobrazení výsledku je třeba se přihlásit.
K zobrazení výsledku je třeba se přihlásit.
Humans are adept at leveraging visual cues from lip movements for recognizing speech in adverse listening conditions. Audio-Visual Speech Recognition (AVSR) models follow similar approach to achieve robust speech recognition in noisy conditions. In t
Externí odkaz:
http://arxiv.org/abs/2405.12983