Výsledky vyhledávání

Report

Audio-visual fine-tuning of audio-only ASR models

Autor: May, Avner, Serdyuk, Dmitriy, Shah, Ankit Parag, Braga, Otavio, Siohan, Olivier

Audio-visual automatic speech recognition (AV-ASR) models are very effective at reducing word error rates on noisy speech, but require large amounts of transcribed AV training data. Recently, audio-visual self-supervised learning (SSL) approaches hav

Externí odkaz: http://arxiv.org/abs/2312.09369

Zobrazit plný text záznamu

Report

On Robustness to Missing Video for Audiovisual Speech Recognition

Autor: Chang, Oscar, Braga, Otavio, Liao, Hank, Serdyuk, Dmitriy, Siohan, Olivier

It has been shown that learning audiovisual features can lead to improved speech recognition performance over audio-only features, especially for noisy speech. However, in many common applications, the visual features are partially or entirely missin

Externí odkaz: http://arxiv.org/abs/2312.10088

Zobrazit plný text záznamu

Report

Revisiting the Entropy Semiring for Neural Speech Recognition

Autor: Chang, Oscar, Hwang, Dongseong, Siohan, Olivier

In streaming settings, speech recognition models have to map sub-sequences of speech to text before the full audio stream becomes available. However, since alignment information between speech and text is rarely available during training, models need

Externí odkaz: http://arxiv.org/abs/2312.10087

Zobrazit plný text záznamu

Report

Cascaded encoders for fine-tuning ASR models on overlapped speech

Autor: Rose, Richard, Chang, Oscar, Siohan, Olivier

Multi-talker speech recognition (MT-ASR) has been shown to improve ASR performance on speech containing overlapping utterances from more than one speaker. Multi-talker models have typically been trained from scratch using simulated or actual overlapp

Externí odkaz: http://arxiv.org/abs/2306.16398

Zobrazit plný text záznamu

Report

Conformers are All You Need for Visual Speech Recognition

Autor: Chang, Oscar, Liao, Hank, Serdyuk, Dmitriy, Shah, Ankit, Siohan, Olivier

Visual speech recognition models extract visual features in a hierarchical manner. At the lower level, there is a visual front-end with a limited temporal receptive field that processes the raw pixels depicting the lips or faces. At the higher level,

Externí odkaz: http://arxiv.org/abs/2302.10915

Zobrazit plný text záznamu

Report

A Closer Look at Audio-Visual Multi-Person Speech Recognition and Active Speaker Selection

Autor: Braga, Otavio, Siohan, Olivier

Audio-visual automatic speech recognition is a promising approach to robust ASR under noisy conditions. However, up until recently it had been traditionally studied in isolation assuming the video of a single speaking face matches the audio, and sele

Externí odkaz: http://arxiv.org/abs/2205.05684

Zobrazit plný text záznamu

Report

End-to-End Multi-Person Audio/Visual Automatic Speech Recognition

Autor: Braga, Otavio, Makino, Takaki, Siohan, Olivier, Liao, Hank

Traditionally, audio-visual automatic speech recognition has been studied under the assumption that the speaking face on the visual signal is the face matching the audio. However, in a more realistic setting, when multiple faces are potentially on sc

Externí odkaz: http://arxiv.org/abs/2205.05586

Zobrazit plný text záznamu

Report

Best of Both Worlds: Multi-task Audio-Visual Automatic Speech Recognition and Active Speaker Detection

Autor: Braga, Otavio, Siohan, Olivier

Under noisy conditions, automatic speech recognition (ASR) can greatly benefit from the addition of visual signals coming from a video of the speaker's face. However, when multiple candidate speakers are visible this traditionally requires solving a

Externí odkaz: http://arxiv.org/abs/2205.05206

Zobrazit plný text záznamu

Report

End-to-end multi-talker audio-visual ASR using an active speaker attention module

Autor: Rose, Richard, Siohan, Olivier

This paper presents a new approach for end-to-end audio-visual multi-talker speech recognition. The approach, referred to here as the visual context attention model (VCAM), is important because it uses the available video information to assign decode

Externí odkaz: http://arxiv.org/abs/2204.00652

Zobrazit plný text záznamu

Report

Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition for Single and Multi-Person Video

Autor: Serdyuk, Dmitriy, Braga, Otavio, Siohan, Olivier

Audio-visual automatic speech recognition (AV-ASR) extends speech recognition by introducing the video modality as an additional source of information. In this work, the information contained in the motion of the speaker's mouth is used to augment th

Externí odkaz: http://arxiv.org/abs/2201.10439

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání