Výsledky vyhledávání

Report

Contextualized End-to-end Automatic Speech Recognition with Intermediate Biasing Loss

Autor: Shakeel, Muhammad, Sudo, Yui, Peng, Yifan, Watanabe, Shinji

Contextualized end-to-end automatic speech recognition has been an active research area, with recent efforts focusing on the implicit learning of contextual phrases based on the final loss objective. However, these approaches ignore the useful contex

Externí odkaz: http://arxiv.org/abs/2406.16120

Zobrazit plný text záznamu

Report

4D ASR: Joint Beam Search Integrating CTC, Attention, Transducer, and Mask Predict Decoders

Autor: Sudo, Yui, Shakeel, Muhammad, Fukumoto, Yosuke, Yan, Brian, Shi, Jiatong, Peng, Yifan, Watanabe, Shinji

End-to-end automatic speech recognition (E2E-ASR) can be classified into several network architectures, such as connectionist temporal classification (CTC), recurrent neural network transducer (RNN-T), attention-based encoder-decoder, and mask-predic

Externí odkaz: http://arxiv.org/abs/2406.02950

Zobrazit plný text záznamu

Report

Joint Optimization of Streaming and Non-Streaming Automatic Speech Recognition with Multi-Decoder and Knowledge Distillation

Autor: Shakeel, Muhammad, Sudo, Yui, Peng, Yifan, Watanabe, Shinji

End-to-end (E2E) automatic speech recognition (ASR) can operate in two modes: streaming and non-streaming, each with its pros and cons. Streaming ASR processes the speech frames in real-time as it is being received, while non-streaming ASR waits for

Externí odkaz: http://arxiv.org/abs/2405.13514

Zobrazit plný text záznamu

Report

Contextualized Automatic Speech Recognition with Dynamic Vocabulary

Autor: Sudo, Yui, Fukumoto, Yosuke, Shakeel, Muhammad, Peng, Yifan, Watanabe, Shinji

Deep biasing (DB) enhances the performance of end-to-end automatic speech recognition (E2E-ASR) models for rare words or contextual phrases using a bias list. However, most existing methods treat bias phrases as sequences of subwords in a predefined

Externí odkaz: http://arxiv.org/abs/2405.13344

Zobrazit plný text záznamu

Report

OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification

Autor: Peng, Yifan, Sudo, Yui, Shakeel, Muhammad, Watanabe, Shinji

There has been an increasing interest in large speech models that can perform multiple tasks in a single model. Such models usually adopt an encoder-decoder or decoder-only architecture due to their popularity and good performance in many domains. Ho

Externí odkaz: http://arxiv.org/abs/2402.12654

Zobrazit plný text záznamu

Report

OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer

Autor: Peng, Yifan, Tian, Jinchuan, Chen, William, Arora, Siddhant, Yan, Brian, Sudo, Yui, Shakeel, Muhammad, Choi, Kwanghee, Shi, Jiatong, Chang, Xuankai, Jung, Jee-weon, Watanabe, Shinji

Recent studies have highlighted the importance of fully open foundation models. The Open Whisper-style Speech Model (OWSM) is an initial step towards reproducing OpenAI Whisper using public data and open-source toolkits. However, previous versions of

Externí odkaz: http://arxiv.org/abs/2401.16658

Zobrazit plný text záznamu

Report

Contextualized Automatic Speech Recognition with Attention-Based Bias Phrase Boosted Beam Search

Autor: Sudo, Yui, Shakeel, Muhammad, Fukumoto, Yosuke, Peng, Yifan, Watanabe, Shinji

End-to-end (E2E) automatic speech recognition (ASR) methods exhibit remarkable performance. However, since the performance of such methods is intrinsically linked to the context present in the training data, E2E-ASR methods do not perform as desired

Externí odkaz: http://arxiv.org/abs/2401.10449

Zobrazit plný text záznamu

Report

Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data

Autor: Peng, Yifan, Tian, Jinchuan, Yan, Brian, Berrebbi, Dan, Chang, Xuankai, Li, Xinjian, Shi, Jiatong, Arora, Siddhant, Chen, William, Sharma, Roshan, Zhang, Wangyou, Sudo, Yui, Shakeel, Muhammad, Jung, Jee-weon, Maiti, Soumi, Watanabe, Shinji

Pre-training speech models on large volumes of data has achieved remarkable success. OpenAI Whisper is a multilingual multitask model trained on 680k hours of supervised speech data. It generalizes well to various speech recognition and translation b

Externí odkaz: http://arxiv.org/abs/2309.13876

Zobrazit plný text záznamu

Report

Retraining-free Customized ASR for Enharmonic Words Based on a Named-Entity-Aware Model and Phoneme Similarity Estimation

Autor: Sudo, Yui, Hata, Kazuya, Nakadai, Kazuhiro

End-to-end automatic speech recognition (E2E-ASR) has the potential to improve performance, but a specific issue that needs to be addressed is the difficulty it has in handling enharmonic words: named entities (NEs) with the same pronunciation and pa

Externí odkaz: http://arxiv.org/abs/2305.17846

Zobrazit plný text záznamu

Report

DPHuBERT: Joint Distillation and Pruning of Self-Supervised Speech Models

Autor: Peng, Yifan, Sudo, Yui, Muhammad, Shakeel, Watanabe, Shinji

Self-supervised learning (SSL) has achieved notable success in many speech processing tasks, but the large model size and heavy computational cost hinder the deployment. Knowledge distillation trains a small student model to mimic the behavior of a l

Externí odkaz: http://arxiv.org/abs/2305.17651

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání