Zobrazeno 1 - 10
of 24
pro vyhledávání: '"Abdelaziz, Ahmed Hussen"'
Autor:
Aldeneh, Zakaria, Higuchi, Takuya, Jung, Jee-weon, Chen, Li-Wei, Shum, Stephen, Abdelaziz, Ahmed Hussen, Watanabe, Shinji, Likhomanenko, Tatiana, Theobald, Barry-John
Iterative self-training, or iterative pseudo-labeling (IPL)--using an improved model from the current iteration to provide pseudo-labels for the next iteration--has proven to be a powerful approach to enhance the quality of speaker representations. R
Externí odkaz:
http://arxiv.org/abs/2409.10791
Autor:
Chen, Li-Wei, Higuchi, Takuya, Bai, He, Abdelaziz, Ahmed Hussen, Rudnicky, Alexander, Watanabe, Shinji, Likhomanenko, Tatiana, Theobald, Barry-John, Aldeneh, Zakaria
Speech foundation models, such as HuBERT and its variants, are pre-trained on large amounts of unlabeled speech for various downstream tasks. These models use a masked prediction objective, where the model learns to predict information about masked i
Externí odkaz:
http://arxiv.org/abs/2409.10788
Autor:
Palaskar, Shruti, Rudovic, Oggi, Dharur, Sameer, Pesce, Florian, Krishna, Gautam, Sivaraman, Aswin, Berkowitz, Jack, Abdelaziz, Ahmed Hussen, Adya, Saurabh, Tewfik, Ahmed
Although Large Language Models (LLMs) have shown promise for human-like conversations, they are primarily pre-trained on text data. Incorporating audio or video improves performance, but collecting large-scale multimodal data and pre-training multimo
Externí odkaz:
http://arxiv.org/abs/2406.09617
Autor:
Kumar, Satyam, Buddi, Sai Srujana, Sarawgi, Utkarsh Oggy, Garg, Vineet, Ranjan, Shivesh, Ognjen, Rudovic, Abdelaziz, Ahmed Hussen, Adya, Saurabh
Voice activity detection (VAD) is a critical component in various applications such as speech recognition, speech enhancement, and hands-free communication systems. With the increasing demand for personalized and context-aware technologies, the need
Externí odkaz:
http://arxiv.org/abs/2406.09443
Autor:
Aldeneh, Zakaria, Higuchi, Takuya, Jung, Jee-weon, Seto, Skyler, Likhomanenko, Tatiana, Shum, Stephen, Abdelaziz, Ahmed Hussen, Watanabe, Shinji, Theobald, Barry-John
Self-supervised features are typically used in place of filter-bank features in speaker verification models. However, these models were originally designed to ingest filter-bank features as inputs, and thus, training them on top of self-supervised fe
Externí odkaz:
http://arxiv.org/abs/2402.00340
Autor:
Jung, Jee-weon, Zhang, Wangyou, Shi, Jiatong, Aldeneh, Zakaria, Higuchi, Takuya, Theobald, Barry-John, Abdelaziz, Ahmed Hussen, Watanabe, Shinji
This paper introduces ESPnet-SPK, a toolkit designed with several objectives for training speaker embedding extractors. First, we provide an open-source platform for researchers in the speaker recognition community to effortlessly build models. We pr
Externí odkaz:
http://arxiv.org/abs/2401.17230
Autor:
Krishna, Gautam, Dharur, Sameer, Rudovic, Oggi, Dighe, Pranay, Adya, Saurabh, Abdelaziz, Ahmed Hussen, Tewfik, Ahmed H
Device-directed speech detection (DDSD) is the binary classification task of distinguishing between queries directed at a voice assistant versus side conversation or background speech. State-of-the-art DDSD systems use verbal cues, e.g acoustic, text
Externí odkaz:
http://arxiv.org/abs/2310.15261
Autor:
Abdelaziz, Ahmed Hussen, Kumar, Anushree Prasanna, Seivwright, Chloe, Fanelli, Gabriele, Binder, Justin, Stylianou, Yannis, Kajarekar, Sachin
Audiovisual speech synthesis is the problem of synthesizing a talking face while maximizing the coherency of the acoustic and visual speech. In this paper, we propose and compare two audiovisual speech synthesis systems for 3D face models. The first
Externí odkaz:
http://arxiv.org/abs/2008.00620
Autor:
Abdelaziz, Ahmed Hussen, Theobald, Barry-John, Dixon, Paul, Knothe, Reinhard, Apostoloff, Nicholas, Kajareker, Sachin
We describe our novel deep learning approach for driving animated faces using both acoustic and visual information. In particular, speech-related facial movements are generated using audiovisual information, and non-speech facial movements are genera
Externí odkaz:
http://arxiv.org/abs/2005.13616
Autor:
Aldeneh, Zakaria, Kumar, Anushree Prasanna, Theobald, Barry-John, Marchi, Erik, Kajarekar, Sachin, Naik, Devang, Abdelaziz, Ahmed Hussen
We present an introspection of an audiovisual speech enhancement model. In particular, we focus on interpreting how a neural audiovisual speech enhancement model uses visual cues to improve the quality of the target speech signal. We show that visual
Externí odkaz:
http://arxiv.org/abs/2004.12031