Výsledky vyhledávání

Simulating Realistic Speech Overlaps Improves Multi-Talker ASR

Autor: Yang, Muqiao, Kanda, Naoyuki, Wang, Xiaofei, Wu, Jian, Sivasankaran, Sunit, Chen, Zhuo, Li, Jinyu, Yoshioka, Takuya

Publikováno v: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

Multi-talker automatic speech recognition (ASR) has been studied to generate transcriptions of natural conversation including overlapping speech of multiple speakers. Due to the difficulty in acquiring real conversation data with high-quality human t

Externí odkaz: https://explore.openaire.eu/search/publication?articleId=doi_dedup___::e45f3594b8f4b97da5abdc8d894a9b9e
https://doi.org/10.1109/icassp49357.2023.10094928

Zobrazit plný text záznamu

Improving Contextual Spelling Correction by External Acoustics Attention and Semantic Aware Data Augmentation

Autor: Wang, Xiaoqiang, Liu, Yanqing, Li, Jinyu, Zhao, Sheng

Publikováno v: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

We previously proposed contextual spelling correction (CSC) to correct the output of end-to-end (E2E) automatic speech recognition (ASR) models with contextual information such as name, place, etc. Although CSC has achieved reasonable improvement in

Externí odkaz: https://explore.openaire.eu/search/publication?articleId=doi_dedup___::08e72a8e04c73731f969cf87d4b0e4df
https://doi.org/10.1109/icassp49357.2023.10095434

Zobrazit plný text záznamu

Joint Pre-Training with Speech and Bilingual Text for Direct Speech to Speech Translation

Autor: Wei, Kun, Zhou, Long, Zhang, Ziqiang, Chen, Liping, Liu, Shujie, He, Lei, Li, Jinyu, Wei, Furu

Publikováno v: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

Direct speech-to-speech translation (S2ST) is an attractive research topic with many advantages compared to cascaded S2ST. However, direct S2ST suffers from the data scarcity problem because the corpora from speech of the source language to speech of

Externí odkaz: https://explore.openaire.eu/search/publication?articleId=doi_dedup___::02bec1e803cd87d7ac2d53d0b921a952
https://doi.org/10.1109/icassp49357.2023.10095616

Zobrazit plný text záznamu

VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation

Autor: Wang, Tianrui, Zhou, Long, Zhang, Ziqiang, Wu, Yu, Liu, Shujie, Gaur, Yashesh, Chen, Zhuo, Li, Jinyu, Wei, Furu

Recent research shows a big convergence in model architecture, training objectives, and inference methods across various tasks for different modalities. In this paper, we propose VioLA, a single auto-regressive Transformer decoder-only network that u

Externí odkaz: https://explore.openaire.eu/search/publication?articleId=doi_dedup___::071e31d93750cf76e2c5f8f576a56722
http://arxiv.org/abs/2305.16107

Zobrazit plný text záznamu

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Autor: Wang, Chengyi, Chen, Sanyuan, Wu, Yu, Zhang, Ziqiang, Zhou, Long, Liu, Shujie, Chen, Zhuo, Liu, Yanqing, Wang, Huaming, Li, Jinyu, He, Lei, Zhao, Sheng, Wei, Furu

We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called Vall-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a condit

Externí odkaz: https://explore.openaire.eu/search/publication?articleId=doi_dedup___::d7122f1b2f27efd14eece69f9f82e478

Zobrazit plný text záznamu

Token-Level Serialized Output Training for Joint Streaming ASR and ST Leveraging Textual Alignments

Autor: Papi, Sara, Wan, Peidong, Chen, Junkun, Xue, Jian, Li, Jinyu, Gaur, Yashesh

In real-world applications, users often require both translations and transcriptions of speech to enhance their comprehension, particularly in streaming scenarios where incremental generation is necessary. This paper introduces a streaming Transforme

Externí odkaz: https://explore.openaire.eu/search/publication?articleId=doi_dedup___::97f88fbd214361d264329b545e34862f

Zobrazit plný text záznamu

Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling

Autor: Zhang, Ziqiang, Zhou, Long, Wang, Chengyi, Chen, Sanyuan, Wu, Yu, Liu, Shujie, Chen, Zhuo, Liu, Yanqing, Wang, Huaming, Li, Jinyu, He, Lei, Zhao, Sheng, Wei, Furu

We propose a cross-lingual neural codec language model, VALL-E X, for cross-lingual speech synthesis. Specifically, we extend VALL-E and train a multi-lingual conditional codec language model to predict the acoustic token sequences of the target lang

Externí odkaz: https://explore.openaire.eu/search/publication?articleId=doi_dedup___::4332efdacb44ce5ea14a0c5016fb69e1

Zobrazit plný text záznamu

LAMASSU: A Streaming Language-Agnostic Multilingual Speech Recognition and Translation Model Using Neural Transducers

Autor: Wang, Peidong, Sun, Eric, Xue, Jian, Wu, Yu, Zhou, Long, Gaur, Yashesh, Liu, Shujie, Li, Jinyu

Automatic speech recognition (ASR) and speech translation (ST) can both use neural transducers as the model structure. It is thus possible to use a single transducer model to perform both tasks. In real-world applications, such joint ASR and ST model

Externí odkaz: https://explore.openaire.eu/search/publication?articleId=doi_dedup___::baee0aee263ba5ee17798574a87daca5

Zobrazit plný text záznamu

Acoustic-aware Non-autoregressive Spell Correction with Mask Sample Decoding

Autor: Fan, Ruchao, Ye, Guoli, Gaur, Yashesh, Li, Jinyu

Masked language model (MLM) has been widely used for understanding tasks, e.g. BERT. Recently, MLM has also been used for generation tasks. The most popular one in speech is using Mask-CTC for non-autoregressive speech recognition. In this paper, we

Externí odkaz: https://explore.openaire.eu/search/publication?articleId=doi_dedup___::14a1b9148c7a7678b5a4b8df4f8fac75

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání