Výsledky vyhledávání

Report

Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data

Autor: Lu, Ke-Han, Chen, Zhehuai, Fu, Szu-Wei, Yang, Chao-Han Huck, Balam, Jagadeesh, Ginsburg, Boris, Wang, Yu-Chiang Frank, Lee, Hung-yi

Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language models (LLMs) by incorporating pre-trained speech models. However, these SLMs often undergo extensive speech instruction-tuning to bridge the gap be

Externí odkaz: http://arxiv.org/abs/2409.20007

Zobrazit plný text záznamu

Report

EMMeTT: Efficient Multimodal Machine Translation Training

Autor: Żelasko, Piotr, Chen, Zhehuai, Wang, Mengru, Galvez, Daniel, Hrinchuk, Oleksii, Ding, Shuoyang, Hu, Ke, Balam, Jagadeesh, Lavrukhin, Vitaly, Ginsburg, Boris

A rising interest in the modality extension of foundation language models warrants discussion on the most effective, and efficient, multimodal training approach. This work focuses on neural machine translation (NMT) and proposes a joint multimodal tr

Externí odkaz: http://arxiv.org/abs/2409.13523

Zobrazit plný text záznamu

Report

META-CAT: Speaker-Informed Speech Embeddings via Meta Information Concatenation for Multi-talker ASR

Autor: Wang, Jinhan, Wang, Weiqing, Dhawan, Kunal, Park, Taejin, Kim, Myungjong, Medennikov, Ivan, Huang, He, Koluguri, Nithin, Balam, Jagadeesh, Ginsburg, Boris

We propose a novel end-to-end multi-talker automatic speech recognition (ASR) framework that enables both multi-speaker (MS) ASR and target-speaker (TS) ASR. Our proposed model is trained in a fully end-to-end manner, incorporating speaker supervisio

Externí odkaz: http://arxiv.org/abs/2409.12352

Zobrazit plný text záznamu

Report

Chain-of-Thought Prompting for Speech Translation

Autor: Hu, Ke, Chen, Zhehuai, Yang, Chao-Han Huck, Żelasko, Piotr, Hrinchuk, Oleksii, Lavrukhin, Vitaly, Balam, Jagadeesh, Ginsburg, Boris

Large language models (LLMs) have demonstrated remarkable advancements in language understanding and generation. Building on the success of text-based LLMs, recent research has adapted these models to use speech embeddings for prompting, resulting in

Externí odkaz: http://arxiv.org/abs/2409.11538

Zobrazit plný text záznamu

Report

Large Language Model Based Generative Error Correction: A Challenge and Baselines for Speech Recognition, Speaker Tagging, and Emotion Recognition

Given recent advances in generative AI technology, a key question is how large language models (LLMs) can enhance acoustic modeling tasks using text decoding results from a frozen, pretrained automatic speech recognition (ASR) model. To explore new c

Externí odkaz: http://arxiv.org/abs/2409.09785

Zobrazit plný text záznamu

Report

Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens

Autor: Park, Taejin, Medennikov, Ivan, Dhawan, Kunal, Wang, Weiqing, Huang, He, Koluguri, Nithin Rao, Puvvada, Krishna C., Balam, Jagadeesh, Ginsburg, Boris

We propose Sortformer, a novel neural model for speaker diarization, trained with unconventional objectives compared to existing end-to-end diarization models. The permutation problem in speaker diarization has long been regarded as a critical challe

Externí odkaz: http://arxiv.org/abs/2409.06656

Zobrazit plný text záznamu

Report

Longer is (Not Necessarily) Stronger: Punctuated Long-Sequence Training for Enhanced Speech Recognition and Translation

Autor: Koluguri, Nithin Rao, Bartley, Travis, Xu, Hainan, Hrinchuk, Oleksii, Balam, Jagadeesh, Ginsburg, Boris, Kucsko, Georg

This paper presents a new method for training sequence-to-sequence models for speech recognition and translation tasks. Instead of the traditional approach of training models on short segments containing only lowercase or partial punctuation and capi

Externí odkaz: http://arxiv.org/abs/2409.05601

Zobrazit plný text záznamu

Report

Resource-Efficient Adaptation of Speech Foundation Models for Multi-Speaker ASR

Autor: Wang, Weiqing, Dhawan, Kunal, Park, Taejin, Puvvada, Krishna C., Medennikov, Ivan, Majumdar, Somshubra, Huang, He, Balam, Jagadeesh, Ginsburg, Boris

Speech foundation models have achieved state-of-the-art (SoTA) performance across various tasks, such as automatic speech recognition (ASR) in hundreds of languages. However, multi-speaker ASR remains a challenging task for these models due to data s

Externí odkaz: http://arxiv.org/abs/2409.01438

Zobrazit plný text záznamu

Report

NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks

Autor: Huang, He, Park, Taejin, Dhawan, Kunal, Medennikov, Ivan, Puvvada, Krishna C., Koluguri, Nithin Rao, Wang, Weiqing, Balam, Jagadeesh, Ginsburg, Boris

Self-supervised learning has been proved to benefit a wide range of speech processing tasks, such as speech recognition/translation, speaker verification and diarization, etc. However, most of current approaches are computationally expensive. In this

Externí odkaz: http://arxiv.org/abs/2408.13106

Zobrazit plný text záznamu

Report

Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models

Autor: Majumdar, Somshubra, Noroozi, Vahid, Narenthiran, Sean, Ficek, Aleksander, Balam, Jagadeesh, Ginsburg, Boris

Large Language Models (LLMs) rely on instruction samples for alignment, but creating these datasets poses challenges, particularly in expert-dependent tasks like coding, which can be cost-prohibitive. One approach to mitigate these challenges is synt

Externí odkaz: http://arxiv.org/abs/2407.21077

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání