Výsledky vyhledávání - "Kanda, Naoyuki"

Report

E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS

Autor: Eskimez, Sefik Emre, Wang, Xiaofei, Thakker, Manthan, Li, Canrun, Tsai, Chung-Hsien, Xiao, Zhen, Yang, Hemin, Zhu, Zirun, Tang, Min, Tan, Xu, Liu, Yanqing, Zhao, Sheng, Kanda, Naoyuki

This paper introduces Embarrassingly Easy Text-to-Speech (E2 TTS), a fully non-autoregressive zero-shot text-to-speech system that offers human-level naturalness and state-of-the-art speaker similarity and intelligibility. In the E2 TTS framework, th

Externí odkaz: http://arxiv.org/abs/2406.18009

Zobrazit plný text záznamu

Report

An Investigation of Noise Robustness for Flow-Matching-Based Zero-Shot TTS

Autor: Wang, Xiaofei, Eskimez, Sefik Emre, Thakker, Manthan, Yang, Hemin, Zhu, Zirun, Tang, Min, Xia, Yufei, Li, Jinzhu, Zhao, Sheng, Li, Jinyu, Kanda, Naoyuki

Recently, zero-shot text-to-speech (TTS) systems, capable of synthesizing any speaker's voice from a short audio prompt, have made rapid advancements. However, the quality of the generated speech significantly deteriorates when the audio prompt conta

Externí odkaz: http://arxiv.org/abs/2406.05699

Zobrazit plný text záznamu

Report

Total-Duration-Aware Duration Modeling for Text-to-Speech Systems

Autor: Eskimez, Sefik Emre, Wang, Xiaofei, Thakker, Manthan, Tsai, Chung-Hsien, Li, Canrun, Xiao, Zhen, Yang, Hemin, Zhu, Zirun, Tang, Min, Li, Jinyu, Zhao, Sheng, Kanda, Naoyuki

Accurate control of the total duration of generated speech by adjusting the speech rate is crucial for various text-to-speech (TTS) applications. However, the impact of adjusting the speech rate on speech quality, such as intelligibility and speaker

Externí odkaz: http://arxiv.org/abs/2406.04281

Zobrazit plný text záznamu

Report

Making Flow-Matching-Based Zero-Shot Text-to-Speech Laugh as You Like

Autor: Kanda, Naoyuki, Wang, Xiaofei, Eskimez, Sefik Emre, Thakker, Manthan, Yang, Hemin, Zhu, Zirun, Tang, Min, Li, Canrun, Tsai, Chung-Hsien, Xiao, Zhen, Xia, Yufei, Li, Jinzhu, Liu, Yanqing, Zhao, Sheng, Zeng, Michael

Laughter is one of the most expressive and natural aspects of human speech, conveying emotions, social cues, and humor. However, most text-to-speech (TTS) systems lack the ability to produce realistic and appropriate laughter sounds, limiting their a

Externí odkaz: http://arxiv.org/abs/2402.07383

Zobrazit plný text záznamu

Report

Leveraging Timestamp Information for Serialized Joint Streaming Recognition and Translation

Autor: Papi, Sara, Wang, Peidong, Chen, Junkun, Xue, Jian, Kanda, Naoyuki, Li, Jinyu, Gaur, Yashesh

The growing need for instant spoken language transcription and translation is driven by increased global communication and cross-lingual interactions. This has made offering translations in multiple languages essential for user applications. Traditio

Externí odkaz: http://arxiv.org/abs/2310.14806

Zobrazit plný text záznamu

Report

Profile-Error-Tolerant Target-Speaker Voice Activity Detection

Autor: Wang, Dongmei, Xiao, Xiong, Kanda, Naoyuki, Yousefi, Midia, Yoshioka, Takuya, Wu, Jian

Target-Speaker Voice Activity Detection (TS-VAD) utilizes a set of speaker profiles alongside an input audio signal to perform speaker diarization. While its superiority over conventional methods has been demonstrated, the method can suffer from erro

Externí odkaz: http://arxiv.org/abs/2309.12521

Zobrazit plný text záznamu

Report

t-SOT FNT: Streaming Multi-talker ASR with Text-only Domain Adaptation Capability

Autor: Wu, Jian, Kanda, Naoyuki, Yoshioka, Takuya, Zhao, Rui, Chen, Zhuo, Li, Jinyu

Token-level serialized output training (t-SOT) was recently proposed to address the challenge of streaming multi-talker automatic speech recognition (ASR). T-SOT effectively handles overlapped speech by representing multi-talker transcriptions as a s

Externí odkaz: http://arxiv.org/abs/2309.08131

Zobrazit plný text záznamu

Report

DiariST: Streaming Speech Translation with Speaker Diarization

Autor: Yang, Mu, Kanda, Naoyuki, Wang, Xiaofei, Chen, Junkun, Wang, Peidong, Xue, Jian, Li, Jinyu, Yoshioka, Takuya

End-to-end speech translation (ST) for conversation recordings involves several under-explored challenges such as speaker diarization (SD) without accurate word time stamps and handling of overlapping speech in a streaming fashion. In this work, we p

Externí odkaz: http://arxiv.org/abs/2309.08007

Zobrazit plný text záznamu

Report

SpeechX: Neural Codec Language Model as a Versatile Speech Transformer

Autor: Wang, Xiaofei, Thakker, Manthan, Chen, Zhuo, Kanda, Naoyuki, Eskimez, Sefik Emre, Chen, Sanyuan, Tang, Min, Liu, Shujie, Li, Jinyu, Yoshioka, Takuya

Recent advancements in generative speech models based on audio-text prompts have enabled remarkable innovations like high-quality zero-shot text-to-speech. However, existing models still face limitations in handling diverse audio-text speech generati

Externí odkaz: http://arxiv.org/abs/2308.06873

Zobrazit plný text záznamu

Report

Adapting Multi-Lingual ASR Models for Handling Multiple Talkers

Autor: Li, Chenda, Qian, Yao, Chen, Zhuo, Kanda, Naoyuki, Wang, Dongmei, Yoshioka, Takuya, Qian, Yanmin, Zeng, Michael

State-of-the-art large-scale universal speech models (USMs) show a decent automatic speech recognition (ASR) performance across multiple domains and languages. However, it remains a challenge for these models to recognize overlapped speech, which is

Externí odkaz: http://arxiv.org/abs/2305.18747

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání