Výsledky vyhledávání - "Kim, Minchan"

Report

High Fidelity Text-to-Speech Via Discrete Tokens Using Token Transducer and Group Masked Language Model

Autor: Lee, Joun Yeop, Jeong, Myeonghun, Kim, Minchan, Lee, Ji-Hyun, Cho, Hoon-Young, Kim, Nam Soo

We propose a novel two-stage text-to-speech (TTS) framework with two types of discrete tokens, i.e., semantic and acoustic tokens, for high-fidelity speech synthesis. It features two core components: the Interpreting module, which processes text and

Externí odkaz: http://arxiv.org/abs/2406.17310

Zobrazit plný text záznamu

Report

MakeSinger: A Semi-Supervised Training Method for Data-Efficient Singing Voice Synthesis via Classifier-free Diffusion Guidance

Autor: Kim, Semin, Jeong, Myeonghun, Lee, Hyeonseung, Kim, Minchan, Choi, Byoung Jin, Kim, Nam Soo

In this paper, we propose MakeSinger, a semi-supervised training method for singing voice synthesis (SVS) via classifier-free diffusion guidance. The challenge in SVS lies in the costly process of gathering aligned sets of text, pitch, and audio data

Externí odkaz: http://arxiv.org/abs/2406.05965

Zobrazit plný text záznamu

Report

Exploiting Semantic Reconstruction to Mitigate Hallucinations in Vision-Language Models

Autor: Kim, Minchan, Kim, Minyeong, Bae, Junik, Choi, Suhwan, Kim, Sungkyung, Chang, Buru

Hallucinations in vision-language models pose a significant challenge to their reliability, particularly in the generation of long captions. Current methods fall short of accurately identifying and mitigating these hallucinations. To address this iss

Externí odkaz: http://arxiv.org/abs/2403.16167

Zobrazit plný text záznamu

Report

Utilizing Neural Transducers for Two-Stage Text-to-Speech via Semantic Token Prediction

Autor: Kim, Minchan, Jeong, Myeonghun, Choi, Byoung Jin, Kim, Semin, Lee, Joun Yeop, Kim, Nam Soo

We propose a novel text-to-speech (TTS) framework centered around a neural transducer. Our approach divides the whole TTS pipeline into semantic-level sequence-to-sequence (seq2seq) modeling and fine-grained acoustic modeling stages, utilizing discre

Externí odkaz: http://arxiv.org/abs/2401.01498

Zobrazit plný text záznamu

Report

Efficient Parallel Audio Generation using Group Masked Language Modeling

Autor: Jeong, Myeonghun, Kim, Minchan, Lee, Joun Yeop, Kim, Nam Soo

We present a fast and high-quality codec language model for parallel audio generation. While SoundStorm, a state-of-the-art parallel audio generation model, accelerates inference speed compared to autoregressive models, it still suffers from slow inf

Externí odkaz: http://arxiv.org/abs/2401.01099

Zobrazit plný text záznamu

Report

Transduce and Speak: Neural Transducer for Text-to-Speech with Semantic Token Prediction

Autor: Kim, Minchan, Jeong, Myeonghun, Choi, Byoung Jin, Lee, Dongjune, Kim, Nam Soo

We introduce a text-to-speech(TTS) framework based on a neural transducer. We use discretized semantic tokens acquired from wav2vec2.0 embeddings, which makes it easy to adopt a neural transducer for the TTS framework enjoying its monotonic alignment

Externí odkaz: http://arxiv.org/abs/2311.02898

Zobrazit plný text záznamu

Report

Pre- and post-contact policy decomposition for non-prehensile manipulation with zero-shot sim-to-real transfer

Autor: Kim, Minchan, Han, Junhyek, Kim, Jaehyung, Kim, Beomjoon

We present a system for non-prehensile manipulation that require a significant number of contact mode transitions and the use of environmental contacts to successfully manipulate an object to a target location. Our method is based on deep reinforceme

Externí odkaz: http://arxiv.org/abs/2309.02754

Zobrazit plný text záznamu

Report

EM-Network: Oracle Guided Self-distillation for Sequence Learning

Autor: Yoon, Ji Won, Ahn, Sunghwan, Lee, Hyeonseung, Kim, Minchan, Kim, Seok Min, Kim, Nam Soo

We introduce EM-Network, a novel self-distillation approach that effectively leverages target information for supervised sequence-to-sequence (seq2seq) learning. In contrast to conventional methods, it is trained with oracle guidance, which is derive

Externí odkaz: http://arxiv.org/abs/2306.10058

Zobrazit plný text záznamu

Report

Adversarial Speaker-Consistency Learning Using Untranscribed Speech Data for Zero-Shot Multi-Speaker Text-to-Speech

Autor: Choi, Byoung Jin, Jeong, Myeonghun, Kim, Minchan, Mun, Sung Hwan, Kim, Nam Soo

Several recently proposed text-to-speech (TTS) models achieved to generate the speech samples with the human-level quality in the single-speaker and multi-speaker TTS scenarios with a set of pre-defined speakers. However, synthesizing a new speaker's

Externí odkaz: http://arxiv.org/abs/2210.05979

Zobrazit plný text záznamu

Report

Fully Unsupervised Training of Few-shot Keyword Spotting

Autor: Lee, Dongjune, Kim, Minchan, Mun, Sung Hwan, Han, Min Hyun, Kim, Nam Soo

For training a few-shot keyword spotting (FS-KWS) model, a large labeled dataset containing massive target keywords has known to be essential to generalize to arbitrary target keywords with only a few enrollment samples. To alleviate the expensive da

Externí odkaz: http://arxiv.org/abs/2210.02732

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání