Výsledky vyhledávání - "Arora, Siddhant"

Report

Hypothesis Clustering and Merging: Novel MultiTalker Speech Recognition with Speaker Tokens

Autor: Kashiwagi, Yosuke, Futami, Hayato, Tsunoo, Emiru, Arora, Siddhant, Watanabe, Shinji

In many real-world scenarios, such as meetings, multiple speakers are present with an unknown number of participants, and their utterances often overlap. We address these multi-speaker challenges by a novel attention-based encoder-decoder method augm

Externí odkaz: http://arxiv.org/abs/2409.15732

Zobrazit plný text záznamu

Report

Task Arithmetic for Language Expansion in Speech Translation

Autor: Cheng, Yao-Fei, Futami, Hayato, Kashiwagi, Yosuke, Tsunoo, Emiru, Teo, Wen Shen, Arora, Siddhant, Watanabe, Shinji

Recent advances in large language models (LLMs) have gained interest in speech-text multimodal foundation models, achieving strong performance on instruction-based speech translation (ST). However, expanding language pairs from an existing instructio

Externí odkaz: http://arxiv.org/abs/2409.11274

Zobrazit plný text záznamu

Report

ESPnet-EZ: Python-only ESPnet for Easy Fine-tuning and Integration

Autor: Someki, Masao, Choi, Kwanghee, Arora, Siddhant, Chen, William, Cornell, Samuele, Han, Jionghao, Peng, Yifan, Shi, Jiatong, Srivastav, Vaibhav, Watanabe, Shinji

We introduce ESPnet-EZ, an extension of the open-source speech processing toolkit ESPnet, aimed at quick and easy development of speech models. ESPnet-EZ focuses on two major aspects: (i) easy fine-tuning and inference of existing ESPnet models on va

Externí odkaz: http://arxiv.org/abs/2409.09506

Zobrazit plný text záznamu

Report

Decoder-only Architecture for Streaming End-to-end Speech Recognition

Autor: Tsunoo, Emiru, Futami, Hayato, Kashiwagi, Yosuke, Arora, Siddhant, Watanabe, Shinji

Decoder-only language models (LMs) have been successfully adopted for speech-processing tasks including automatic speech recognition (ASR). The LMs have ample expressiveness and perform efficiently. This efficiency is a suitable characteristic for st

Externí odkaz: http://arxiv.org/abs/2406.16107

Zobrazit plný text záznamu

Report

Finding Task-specific Subnetworks in Multi-task Spoken Language Understanding Model

Autor: Futami, Hayato, Arora, Siddhant, Kashiwagi, Yosuke, Tsunoo, Emiru, Watanabe, Shinji

Recently, multi-task spoken language understanding (SLU) models have emerged, designed to address various speech processing tasks. However, these models often rely on a large number of parameters. Also, they often encounter difficulties in adapting t

Externí odkaz: http://arxiv.org/abs/2406.12317

Zobrazit plný text záznamu

Report

Rapid Language Adaptation for Multilingual E2E Speech Recognition Using Encoder Prompting

Autor: Kashiwagi, Yosuke, Futami, Hayato, Tsunoo, Emiru, Arora, Siddhant, Watanabe, Shinji

End-to-end multilingual speech recognition models handle multiple languages through a single model, often incorporating language identification to automatically detect the language of incoming speech. Since the common scenario is where the language i

Externí odkaz: http://arxiv.org/abs/2406.12611

Zobrazit plný text záznamu

Report

On the Evaluation of Speech Foundation Models for Spoken Language Understanding

Autor: Arora, Siddhant, Pasad, Ankita, Chien, Chung-Ming, Han, Jionghao, Sharma, Roshan, Jung, Jee-weon, Dhamyal, Hira, Chen, William, Shon, Suwon, Lee, Hung-yi, Livescu, Karen, Watanabe, Shinji

The Spoken Language Understanding Evaluation (SLUE) suite of benchmark tasks was recently introduced to address the need for open resources and benchmarking of complex spoken language understanding (SLU) tasks, including both classification and seque

Externí odkaz: http://arxiv.org/abs/2406.10083

Zobrazit plný text záznamu

Report

TMT: Tri-Modal Translation between Speech, Image, and Text by Processing Different Modalities as Different Languages

Autor: Kim, Minsu, Jung, Jee-weon, Rha, Hyeongseop, Maiti, Soumi, Arora, Siddhant, Chang, Xuankai, Watanabe, Shinji, Ro, Yong Man

The capability to jointly process multi-modal information is becoming an essential task. However, the limited number of paired multi-modal data and the large computational requirements in multi-modal learning hinder the development. We propose a nove

Externí odkaz: http://arxiv.org/abs/2402.16021

Zobrazit plný text záznamu

Report

OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer

Autor: Peng, Yifan, Tian, Jinchuan, Chen, William, Arora, Siddhant, Yan, Brian, Sudo, Yui, Shakeel, Muhammad, Choi, Kwanghee, Shi, Jiatong, Chang, Xuankai, Jung, Jee-weon, Watanabe, Shinji

Recent studies have highlighted the importance of fully open foundation models. The Open Whisper-style Speech Model (OWSM) is an initial step towards reproducing OpenAI Whisper using public data and open-source toolkits. However, previous versions of

Externí odkaz: http://arxiv.org/abs/2401.16658

Zobrazit plný text záznamu

Report

Phoneme-aware Encoding for Prefix-tree-based Contextual ASR

Autor: Futami, Hayato, Tsunoo, Emiru, Kashiwagi, Yosuke, Ogawa, Hiroaki, Arora, Siddhant, Watanabe, Shinji

In speech recognition applications, it is important to recognize context-specific rare words, such as proper nouns. Tree-constrained Pointer Generator (TCPGen) has shown promise for this purpose, which efficiently biases such words with a prefix tree

Externí odkaz: http://arxiv.org/abs/2312.09582

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání