Zobrazeno 1 - 10
of 402
pro vyhledávání: '"Siohan, P"'
Audio-visual automatic speech recognition (AV-ASR) models are very effective at reducing word error rates on noisy speech, but require large amounts of transcribed AV training data. Recently, audio-visual self-supervised learning (SSL) approaches hav
Externí odkaz:
http://arxiv.org/abs/2312.09369
It has been shown that learning audiovisual features can lead to improved speech recognition performance over audio-only features, especially for noisy speech. However, in many common applications, the visual features are partially or entirely missin
Externí odkaz:
http://arxiv.org/abs/2312.10088
In streaming settings, speech recognition models have to map sub-sequences of speech to text before the full audio stream becomes available. However, since alignment information between speech and text is rarely available during training, models need
Externí odkaz:
http://arxiv.org/abs/2312.10087
Multi-talker speech recognition (MT-ASR) has been shown to improve ASR performance on speech containing overlapping utterances from more than one speaker. Multi-talker models have typically been trained from scratch using simulated or actual overlapp
Externí odkaz:
http://arxiv.org/abs/2306.16398
Visual speech recognition models extract visual features in a hierarchical manner. At the lower level, there is a visual front-end with a limited temporal receptive field that processes the raw pixels depicting the lips or faces. At the higher level,
Externí odkaz:
http://arxiv.org/abs/2302.10915
Autor:
Braga, Otavio, Siohan, Olivier
Audio-visual automatic speech recognition is a promising approach to robust ASR under noisy conditions. However, up until recently it had been traditionally studied in isolation assuming the video of a single speaking face matches the audio, and sele
Externí odkaz:
http://arxiv.org/abs/2205.05684
Traditionally, audio-visual automatic speech recognition has been studied under the assumption that the speaking face on the visual signal is the face matching the audio. However, in a more realistic setting, when multiple faces are potentially on sc
Externí odkaz:
http://arxiv.org/abs/2205.05586
Autor:
Braga, Otavio, Siohan, Olivier
Under noisy conditions, automatic speech recognition (ASR) can greatly benefit from the addition of visual signals coming from a video of the speaker's face. However, when multiple candidate speakers are visible this traditionally requires solving a
Externí odkaz:
http://arxiv.org/abs/2205.05206
Autor:
Rose, Richard, Siohan, Olivier
This paper presents a new approach for end-to-end audio-visual multi-talker speech recognition. The approach, referred to here as the visual context attention model (VCAM), is important because it uses the available video information to assign decode
Externí odkaz:
http://arxiv.org/abs/2204.00652
Audio-visual automatic speech recognition (AV-ASR) extends speech recognition by introducing the video modality as an additional source of information. In this work, the information contained in the motion of the speaker's mouth is used to augment th
Externí odkaz:
http://arxiv.org/abs/2201.10439