Výsledky vyhledávání

Report

LongFNT: Long-form Speech Recognition with Factorized Neural Transducer

Autor: Gong, Xun, Wu, Yu, Li, Jinyu, Liu, Shujie, Zhao, Rui, Chen, Xie, Qian, Yanmin

Traditional automatic speech recognition~(ASR) systems usually focus on individual utterances, without considering long-form speech with useful historical information, which is more practical in real scenarios. Simply attending longer transcription h

Externí odkaz: http://arxiv.org/abs/2211.09412

Zobrazit plný text záznamu

Report

Self-supervised learning with bi-label masked speech prediction for streaming multi-talker speech recognition

Autor: Huang, Zili, Chen, Zhuo, Kanda, Naoyuki, Wu, Jian, Wang, Yiming, Li, Jinyu, Yoshioka, Takuya, Wang, Xiaofei, Wang, Peidong

Self-supervised learning (SSL), which utilizes the input data itself for representation learning, has achieved state-of-the-art results for various downstream speech tasks. However, most of the previous studies focused on offline single-talker applic

Externí odkaz: http://arxiv.org/abs/2211.05564

Zobrazit plný text záznamu

Report

Speech separation with large-scale self-supervised learning

Autor: Chen, Zhuo, Kanda, Naoyuki, Wu, Jian, Wu, Yu, Wang, Xiaofei, Yoshioka, Takuya, Li, Jinyu, Sivasankaran, Sunit, Eskimez, Sefik Emre

Self-supervised learning (SSL) methods such as WavLM have shown promising speech separation (SS) results in small-scale simulation-based experiments. In this work, we extend the exploration of the SSL-based SS by massively scaling up both the pre-tra

Externí odkaz: http://arxiv.org/abs/2211.05172

Zobrazit plný text záznamu

Report

Streaming, fast and accurate on-device Inverse Text Normalization for Automatic Speech Recognition

Autor: Gaur, Yashesh, Kibre, Nick, Xue, Jian, Shu, Kangyuan, Wang, Yuhui, Alphanso, Issac, Li, Jinyu, Gong, Yifan

Automatic Speech Recognition (ASR) systems typically yield output in lexical form. However, humans prefer a written form output. To bridge this gap, ASR systems usually employ Inverse Text Normalization (ITN). In previous works, Weighted Finite State

Externí odkaz: http://arxiv.org/abs/2211.03721

Zobrazit plný text záznamu

Report

LAMASSU: Streaming Language-Agnostic Multilingual Speech Recognition and Translation Using Neural Transducers

Autor: Wang, Peidong, Sun, Eric, Xue, Jian, Wu, Yu, Zhou, Long, Gaur, Yashesh, Liu, Shujie, Li, Jinyu

Automatic speech recognition (ASR) and speech translation (ST) can both use neural transducers as the model structure. It is thus possible to use a single transducer model to perform both tasks. In real-world applications, such joint ASR and ST model

Externí odkaz: http://arxiv.org/abs/2211.02809

Zobrazit plný text záznamu

Report

A Weakly-Supervised Streaming Multilingual Speech Model with Truly Zero-Shot Capability

Autor: Xue, Jian, Wang, Peidong, Li, Jinyu, Sun, Eric

In this paper, we introduce our work of building a Streaming Multilingual Speech Model (SM2), which can transcribe or translate multiple spoken languages into texts of the target language. The backbone of SM2 is Transformer Transducer, which has high

Externí odkaz: http://arxiv.org/abs/2211.02499

Zobrazit plný text záznamu

Report

Joint Pre-Training with Speech and Bilingual Text for Direct Speech to Speech Translation

Autor: Wei, Kun, Zhou, Long, Zhang, Ziqiang, Chen, Liping, Liu, Shujie, He, Lei, Li, Jinyu, Wei, Furu

Direct speech-to-speech translation (S2ST) is an attractive research topic with many advantages compared to cascaded S2ST. However, direct S2ST suffers from the data scarcity problem because the corpora from speech of the source language to speech of

Externí odkaz: http://arxiv.org/abs/2210.17027

Zobrazit plný text záznamu

Report

Simulating realistic speech overlaps improves multi-talker ASR

Autor: Yang, Muqiao, Kanda, Naoyuki, Wang, Xiaofei, Wu, Jian, Sivasankaran, Sunit, Chen, Zhuo, Li, Jinyu, Yoshioka, Takuya

Multi-talker automatic speech recognition (ASR) has been studied to generate transcriptions of natural conversation including overlapping speech of multiple speakers. Due to the difficulty in acquiring real conversation data with high-quality human t

Externí odkaz: http://arxiv.org/abs/2210.15715

Zobrazit plný text záznamu

Report

Acoustic-aware Non-autoregressive Spell Correction with Mask Sample Decoding

Autor: Fan, Ruchao, Ye, Guoli, Gaur, Yashesh, Li, Jinyu

Masked language model (MLM) has been widely used for understanding tasks, e.g. BERT. Recently, MLM has also been used for generation tasks. The most popular one in speech is using Mask-CTC for non-autoregressive speech recognition. In this paper, we

Externí odkaz: http://arxiv.org/abs/2210.08665

Zobrazit plný text záznamu

Report

CTCBERT: Advancing Hidden-unit BERT with CTC Objectives

Autor: Fan, Ruchao, Wang, Yiming, Gaur, Yashesh, Li, Jinyu

In this work, we present a simple but effective method, CTCBERT, for advancing hidden-unit BERT (HuBERT). HuBERT applies a frame-level cross-entropy (CE) loss, which is similar to most acoustic model training. However, CTCBERT performs the model trai

Externí odkaz: http://arxiv.org/abs/2210.08603

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání