Zobrazeno 1 - 10
of 37
pro vyhledávání: '"Huh, Jaesung"'
Diverse actions give rise to rich audio-visual signals in long videos. Recent works showcase that the two modalities of audio and video exhibit different temporal extents of events and distinct labels. We address the interplay between the two modalit
Externí odkaz:
http://arxiv.org/abs/2404.05559
The goal of this paper is automatic character-aware subtitle generation. Given a video and a minimal amount of metadata, we propose an audio-visual method that generates a full transcript of the dialogue, with precise speech timestamps, and the chara
Externí odkaz:
http://arxiv.org/abs/2401.12039
This report presents the technical details of our submission on the EGO4D Audio-Visual (AV) Automatic Speech Recognition Challenge 2023 from the OxfordVGG team. We present WhisperX, a system for efficient speech transcription of long-form audio with
Externí odkaz:
http://arxiv.org/abs/2307.09006
Large-scale, weakly-supervised speech recognition models, such as Whisper, have demonstrated impressive results on speech recognition across domains and languages. However, their application to long audio transcription via buffered or sliding window
Externí odkaz:
http://arxiv.org/abs/2303.00747
Autor:
Huh, Jaesung, Brown, Andrew, Jung, Jee-weon, Chung, Joon Son, Nagrani, Arsha, Garcia-Romero, Daniel, Zisserman, Andrew
This paper summarises the findings from the VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC-22), which was held in conjunction with INTERSPEECH 2022. The goal of this challenge was to evaluate how well state-of-the-art speaker recognition systems
Externí odkaz:
http://arxiv.org/abs/2302.10248
We introduce EPIC-SOUNDS, a large-scale dataset of audio annotations capturing temporal extents and class labels within the audio stream of the egocentric videos. We propose an annotation pipeline where annotators temporally label distinguishable aud
Externí odkaz:
http://arxiv.org/abs/2302.00646
The goal of this paper is to learn robust speaker representation for bilingual speaking scenario. The majority of the world's population speak at least two languages; however, most speaker recognition systems fail to recognise the same speaker when s
Externí odkaz:
http://arxiv.org/abs/2211.00437
Autor:
Jung, Jee-weon, Heo, Hee-Soo, Lee, Bong-Jin, Huh, Jaesung, Brown, Andrew, Kwon, Youngki, Watanabe, Shinji, Chung, Joon Son
Speaker embedding extractors (EEs), which map input audio to a speaker discriminant latent space, are of paramount importance in speaker diarisation. However, there are several challenges when adopting EEs for diarisation, from which we tackle two ke
Externí odkaz:
http://arxiv.org/abs/2210.14682
Autor:
Brown, Andrew, Huh, Jaesung, Chung, Joon Son, Nagrani, Arsha, Garcia-Romero, Daniel, Zisserman, Andrew
The third instalment of the VoxCeleb Speaker Recognition Challenge was held in conjunction with Interspeech 2021. The aim of this challenge was to assess how well current speaker recognition technology is able to diarise and recognise speakers in unc
Externí odkaz:
http://arxiv.org/abs/2201.04583
In egocentric videos, actions occur in quick succession. We capitalise on the action's temporal context and propose a method that learns to attend to surrounding actions in order to improve recognition performance. To incorporate the temporal context
Externí odkaz:
http://arxiv.org/abs/2111.01024