Zobrazeno 1 - 10
of 35
pro vyhledávání: '"Kashiwagi, Yosuke"'
In many real-world scenarios, such as meetings, multiple speakers are present with an unknown number of participants, and their utterances often overlap. We address these multi-speaker challenges by a novel attention-based encoder-decoder method augm
Externí odkaz:
http://arxiv.org/abs/2409.15732
Autor:
Cheng, Yao-Fei, Futami, Hayato, Kashiwagi, Yosuke, Tsunoo, Emiru, Teo, Wen Shen, Arora, Siddhant, Watanabe, Shinji
Recent advances in large language models (LLMs) have gained interest in speech-text multimodal foundation models, achieving strong performance on instruction-based speech translation (ST). However, expanding language pairs from an existing instructio
Externí odkaz:
http://arxiv.org/abs/2409.11274
Decoder-only language models (LMs) have been successfully adopted for speech-processing tasks including automatic speech recognition (ASR). The LMs have ample expressiveness and perform efficiently. This efficiency is a suitable characteristic for st
Externí odkaz:
http://arxiv.org/abs/2406.16107
Recently, multi-task spoken language understanding (SLU) models have emerged, designed to address various speech processing tasks. However, these models often rely on a large number of parameters. Also, they often encounter difficulties in adapting t
Externí odkaz:
http://arxiv.org/abs/2406.12317
End-to-end multilingual speech recognition models handle multiple languages through a single model, often incorporating language identification to automatically detect the language of incoming speech. Since the common scenario is where the language i
Externí odkaz:
http://arxiv.org/abs/2406.12611
Autor:
Futami, Hayato, Tsunoo, Emiru, Kashiwagi, Yosuke, Ogawa, Hiroaki, Arora, Siddhant, Watanabe, Shinji
In speech recognition applications, it is important to recognize context-specific rare words, such as proper nouns. Tree-constrained Pointer Generator (TCPGen) has shown promise for this purpose, which efficiently biases such words with a prefix tree
Externí odkaz:
http://arxiv.org/abs/2312.09582
Autor:
Arora, Siddhant, Futami, Hayato, Jung, Jee-weon, Peng, Yifan, Sharma, Roshan, Kashiwagi, Yosuke, Tsunoo, Emiru, Livescu, Karen, Watanabe, Shinji
Recent studies leverage large language models with multi-tasking capabilities, using natural language prompts to guide the model's behavior and surpassing performance of task-specific models. Motivated by this, we ask: can we build a single model tha
Externí odkaz:
http://arxiv.org/abs/2310.02973
Collecting audio-text pairs is expensive; however, it is much easier to access text-only data. Unless using shallow fusion, end-to-end automatic speech recognition (ASR) models require architecture modifications or additional training schemes to use
Externí odkaz:
http://arxiv.org/abs/2309.08876
Although frame-based models, such as CTC and transducers, have an affinity for streaming automatic speech recognition, their decoding uses no future knowledge, which could lead to incorrect pruning. Conversely, label-based attention encoder-decoder m
Externí odkaz:
http://arxiv.org/abs/2307.12767
Autor:
Arora, Siddhant, Futami, Hayato, Kashiwagi, Yosuke, Tsunoo, Emiru, Yan, Brian, Watanabe, Shinji
There has been an increased interest in the integration of pretrained speech recognition (ASR) and language models (LM) into the SLU framework. However, prior methods often struggle with a vocabulary mismatch between pretrained models, and LM cannot
Externí odkaz:
http://arxiv.org/abs/2307.11005