Zobrazeno 1 - 10
of 97 141
pro vyhledávání: '"HUBERT, P."'
Speech is the most natural way of expressing ourselves as humans. Identifying emotion from speech is a nontrivial task due to the ambiguous definition of emotion itself. Speaker Emotion Recognition (SER) is essential for understanding human emotional
Externí odkaz:
http://arxiv.org/abs/2411.02964
Autor:
Komatsu, Ryota, Shinozaki, Takahiro
Self-supervised speech representation learning has become essential for extracting meaningful features from untranscribed audio. Recent advances highlight the potential of deriving discrete symbols from the features correlated with linguistic units,
Externí odkaz:
http://arxiv.org/abs/2409.10103
Singing voice conversion (SVC) is hindered by noise sensitivity due to the use of non-robust methods for extracting pitch and energy during the inference. As clean signals are key for the source audio in SVC, music source separation preprocessing off
Externí odkaz:
http://arxiv.org/abs/2409.06237
We present mHuBERT-147, the first general-purpose massively multilingual HuBERT speech representation model trained on 90K hours of clean, open-license data. To scale up the multi-iteration HuBERT approach, we use faiss-based clustering, achieving 5.
Externí odkaz:
http://arxiv.org/abs/2406.06371
In recent years, self-supervised pre-training methods have gained significant traction in learning high-level information from raw speech. Among these methods, HuBERT has demonstrated SOTA performance in automatic speech recognition (ASR). However, H
Externí odkaz:
http://arxiv.org/abs/2406.05661
Foundation models have shown great promise in speech emotion recognition (SER) by leveraging their pre-trained representations to capture emotion patterns in speech signals. To further enhance SER performance across various languages and domains, we
Externí odkaz:
http://arxiv.org/abs/2406.10275
Akademický článek
Tento výsledek nelze pro nepřihlášené uživatele zobrazit.
K zobrazení výsledku je třeba se přihlásit.
K zobrazení výsledku je třeba se přihlásit.
Audio-visual target speech extraction (AV-TSE) is one of the enabling technologies in robotics and many audio-visual applications. One of the challenges of AV-TSE is how to effectively utilize audio-visual synchronization information in the process.
Externí odkaz:
http://arxiv.org/abs/2403.16078
Human language can be expressed in either written or spoken form, i.e. text or speech. Humans can acquire knowledge from text to improve speaking and listening. However, the quest for speech pre-trained models to leverage unpaired text has just start
Externí odkaz:
http://arxiv.org/abs/2402.15725
Autor:
Cho, Cheol Jun, Mohamed, Abdelrahman, Li, Shang-Wen, Black, Alan W, Anumanchipalli, Gopala K.
Data-driven unit discovery in self-supervised learning (SSL) of speech has embarked on a new era of spoken language processing. Yet, the discovered units often remain in phonetic space and the units beyond phonemes are largely underexplored. Here, we
Externí odkaz:
http://arxiv.org/abs/2310.10803