Zobrazeno 1 - 10
of 64
pro vyhledávání: '"Arora, Siddhant"'
In many real-world scenarios, such as meetings, multiple speakers are present with an unknown number of participants, and their utterances often overlap. We address these multi-speaker challenges by a novel attention-based encoder-decoder method augm
Externí odkaz:
http://arxiv.org/abs/2409.15732
Autor:
Cheng, Yao-Fei, Futami, Hayato, Kashiwagi, Yosuke, Tsunoo, Emiru, Teo, Wen Shen, Arora, Siddhant, Watanabe, Shinji
Recent advances in large language models (LLMs) have gained interest in speech-text multimodal foundation models, achieving strong performance on instruction-based speech translation (ST). However, expanding language pairs from an existing instructio
Externí odkaz:
http://arxiv.org/abs/2409.11274
Autor:
Someki, Masao, Choi, Kwanghee, Arora, Siddhant, Chen, William, Cornell, Samuele, Han, Jionghao, Peng, Yifan, Shi, Jiatong, Srivastav, Vaibhav, Watanabe, Shinji
We introduce ESPnet-EZ, an extension of the open-source speech processing toolkit ESPnet, aimed at quick and easy development of speech models. ESPnet-EZ focuses on two major aspects: (i) easy fine-tuning and inference of existing ESPnet models on va
Externí odkaz:
http://arxiv.org/abs/2409.09506
Decoder-only language models (LMs) have been successfully adopted for speech-processing tasks including automatic speech recognition (ASR). The LMs have ample expressiveness and perform efficiently. This efficiency is a suitable characteristic for st
Externí odkaz:
http://arxiv.org/abs/2406.16107
Recently, multi-task spoken language understanding (SLU) models have emerged, designed to address various speech processing tasks. However, these models often rely on a large number of parameters. Also, they often encounter difficulties in adapting t
Externí odkaz:
http://arxiv.org/abs/2406.12317
End-to-end multilingual speech recognition models handle multiple languages through a single model, often incorporating language identification to automatically detect the language of incoming speech. Since the common scenario is where the language i
Externí odkaz:
http://arxiv.org/abs/2406.12611
Autor:
Arora, Siddhant, Pasad, Ankita, Chien, Chung-Ming, Han, Jionghao, Sharma, Roshan, Jung, Jee-weon, Dhamyal, Hira, Chen, William, Shon, Suwon, Lee, Hung-yi, Livescu, Karen, Watanabe, Shinji
The Spoken Language Understanding Evaluation (SLUE) suite of benchmark tasks was recently introduced to address the need for open resources and benchmarking of complex spoken language understanding (SLU) tasks, including both classification and seque
Externí odkaz:
http://arxiv.org/abs/2406.10083
Autor:
Kim, Minsu, Jung, Jee-weon, Rha, Hyeongseop, Maiti, Soumi, Arora, Siddhant, Chang, Xuankai, Watanabe, Shinji, Ro, Yong Man
The capability to jointly process multi-modal information is becoming an essential task. However, the limited number of paired multi-modal data and the large computational requirements in multi-modal learning hinder the development. We propose a nove
Externí odkaz:
http://arxiv.org/abs/2402.16021
Autor:
Peng, Yifan, Tian, Jinchuan, Chen, William, Arora, Siddhant, Yan, Brian, Sudo, Yui, Shakeel, Muhammad, Choi, Kwanghee, Shi, Jiatong, Chang, Xuankai, Jung, Jee-weon, Watanabe, Shinji
Recent studies have highlighted the importance of fully open foundation models. The Open Whisper-style Speech Model (OWSM) is an initial step towards reproducing OpenAI Whisper using public data and open-source toolkits. However, previous versions of
Externí odkaz:
http://arxiv.org/abs/2401.16658
Autor:
Futami, Hayato, Tsunoo, Emiru, Kashiwagi, Yosuke, Ogawa, Hiroaki, Arora, Siddhant, Watanabe, Shinji
In speech recognition applications, it is important to recognize context-specific rare words, such as proper nouns. Tree-constrained Pointer Generator (TCPGen) has shown promise for this purpose, which efficiently biases such words with a prefix tree
Externí odkaz:
http://arxiv.org/abs/2312.09582