Výsledky vyhledávání - "Zeng, Michael"

Report

TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation

Autor: Le, Chenyang, Qian, Yao, Wang, Dongmei, Zhou, Long, Liu, Shujie, Wang, Xiaofei, Yousefi, Midia, Qian, Yanmin, Li, Jinyu, Zhao, Sheng, Zeng, Michael

There is a rising interest and trend in research towards directly translating speech from one language to another, known as end-to-end speech-to-speech translation. However, most end-to-end models struggle to outperform cascade models, i.e., a pipeli

Externí odkaz: http://arxiv.org/abs/2405.17809

Zobrazit plný text záznamu

Report

CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations

Autor: Zhang, Leying, Qian, Yao, Zhou, Long, Liu, Shujie, Wang, Dongmei, Wang, Xiaofei, Yousefi, Midia, Qian, Yanmin, Li, Jinyu, He, Lei, Zhao, Sheng, Zeng, Michael

Recent advancements in zero-shot text-to-speech (TTS) modeling have led to significant strides in generating high-fidelity and diverse speech. However, dialogue generation, along with achieving human-like naturalness in speech, continues to be a chal

Externí odkaz: http://arxiv.org/abs/2404.06690

Zobrazit plný text záznamu

Report

Making Flow-Matching-Based Zero-Shot Text-to-Speech Laugh as You Like

Autor: Kanda, Naoyuki, Wang, Xiaofei, Eskimez, Sefik Emre, Thakker, Manthan, Yang, Hemin, Zhu, Zirun, Tang, Min, Li, Canrun, Tsai, Chung-Hsien, Xiao, Zhen, Xia, Yufei, Li, Jinzhu, Liu, Yanqing, Zhao, Sheng, Zeng, Michael

Laughter is one of the most expressive and natural aspects of human speech, conveying emotions, social cues, and humor. However, most text-to-speech (TTS) systems lack the ability to produce realistic and appropriate laughter sounds, limiting their a

Externí odkaz: http://arxiv.org/abs/2402.07383

Zobrazit plný text záznamu

Report

Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks

Autor: Xiao, Bin, Wu, Haiping, Xu, Weijian, Dai, Xiyang, Hu, Houdong, Lu, Yumao, Zeng, Michael, Liu, Ce, Yuan, Lu

We introduce Florence-2, a novel vision foundation model with a unified, prompt-based representation for a variety of computer vision and vision-language tasks. While existing large vision models excel in transfer learning, they struggle to perform a

Externí odkaz: http://arxiv.org/abs/2311.06242

Zobrazit plný text záznamu

Report

Diffusion Conditional Expectation Model for Efficient and Robust Target Speech Extraction

Autor: Zhang, Leying, Qian, Yao, Yu, Linfeng, Wang, Heming, Wang, Xinkai, Yang, Hemin, Zhou, Long, Liu, Shujie, Qian, Yanmin, Zeng, Michael

Target Speech Extraction (TSE) is a crucial task in speech processing that focuses on isolating the clean speech of a specific speaker from complex mixtures. While discriminative methods are commonly used for TSE, they can introduce distortion in ter

Externí odkaz: http://arxiv.org/abs/2309.13874

Zobrazit plný text záznamu

Report

Adapting Large Language Model with Speech for Fully Formatted End-to-End Speech Recognition

Autor: Ling, Shaoshi, Hu, Yuxuan, Qian, Shuangbei, Ye, Guoli, Qian, Yao, Gong, Yifan, Lin, Ed, Zeng, Michael

Most end-to-end (E2E) speech recognition models are composed of encoder and decoder blocks that perform acoustic and language modeling functions. Pretrained large language models (LLMs) have the potential to improve the performance of E2E ASR. Howeve

Externí odkaz: http://arxiv.org/abs/2307.08234

Zobrazit plný text záznamu

Report

Adapting Multi-Lingual ASR Models for Handling Multiple Talkers

Autor: Li, Chenda, Qian, Yao, Chen, Zhuo, Kanda, Naoyuki, Wang, Dongmei, Yoshioka, Takuya, Qian, Yanmin, Zeng, Michael

State-of-the-art large-scale universal speech models (USMs) show a decent automatic speech recognition (ASR) performance across multiple domains and languages. However, it remains a challenge for these models to recognize overlapped speech, which is

Externí odkaz: http://arxiv.org/abs/2305.18747

Zobrazit plný text záznamu

Report

ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text Translation

Autor: Le, Chenyang, Qian, Yao, Zhou, Long, Liu, Shujie, Qian, Yanmin, Zeng, Michael, Huang, Xuedong

Joint speech-language training is challenging due to the large demand for training data and GPU consumption, as well as the modality gap between speech and language. We present ComSL, a speech-language model built atop a composite architecture of pub

Externí odkaz: http://arxiv.org/abs/2305.14838

Zobrazit plný text záznamu

Report

i-Code Studio: A Configurable and Composable Framework for Integrative AI

Autor: Fang, Yuwei, Khademi, Mahmoud, Zhu, Chenguang, Yang, Ziyi, Pryzant, Reid, Xu, Yichong, Qian, Yao, Yoshioka, Takuya, Yuan, Lu, Zeng, Michael, Huang, Xuedong

Artificial General Intelligence (AGI) requires comprehensive understanding and generation capabilities for a variety of tasks spanning different modalities and functionalities. Integrative AI is one important direction to approach AGI, through combin

Externí odkaz: http://arxiv.org/abs/2305.13738

Zobrazit plný text záznamu

Report

LMGQS: A Large-scale Dataset for Query-focused Summarization

Autor: Xu, Ruochen, Wang, Song, Liu, Yang, Wang, Shuohang, Xu, Yichong, Iter, Dan, Zhu, Chenguang, Zeng, Michael

Query-focused summarization (QFS) aims to extract or generate a summary of an input document that directly answers or is relevant to a given query. The lack of large-scale datasets in the form of documents, queries, and summaries has hindered model d

Externí odkaz: http://arxiv.org/abs/2305.13086

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání