Výsledky vyhledávání

Report

VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation

Autor: Wang, Tianrui, Zhou, Long, Zhang, Ziqiang, Wu, Yu, Liu, Shujie, Gaur, Yashesh, Chen, Zhuo, Li, Jinyu, Wei, Furu

Recent research shows a big convergence in model architecture, training objectives, and inference methods across various tasks for different modalities. In this paper, we propose VioLA, a single auto-regressive Transformer decoder-only network that u

Externí odkaz: http://arxiv.org/abs/2305.16107

Zobrazit plný text záznamu

Report

PillarNeXt: Rethinking Network Designs for 3D Object Detection in LiDAR Point Clouds

Autor: Li, Jinyu, Luo, Chenxu, Yang, Xiaodong

In order to deal with the sparse and unstructured raw point clouds, LiDAR based 3D object detection research mostly focuses on designing dedicated local point aggregators for fine-grained geometrical modeling. In this paper, we revisit the local poin

Externí odkaz: http://arxiv.org/abs/2305.04925

Zobrazit plný text záznamu

Report

Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling

Autor: Zhang, Ziqiang, Zhou, Long, Wang, Chengyi, Chen, Sanyuan, Wu, Yu, Liu, Shujie, Chen, Zhuo, Liu, Yanqing, Wang, Huaming, Li, Jinyu, He, Lei, Zhao, Sheng, Wei, Furu

We propose a cross-lingual neural codec language model, VALL-E X, for cross-lingual speech synthesis. Specifically, we extend VALL-E and train a multi-lingual conditional codec language model to predict the acoustic token sequences of the target lang

Externí odkaz: http://arxiv.org/abs/2303.03926

Zobrazit plný text záznamu

Report

Building High-accuracy Multilingual ASR with Gated Language Experts and Curriculum Training

Autor: Sun, Eric, Li, Jinyu, Hu, Yuxuan, Zhu, Yimeng, Zhou, Long, Xue, Jian, Wang, Peidong, Liu, Linquan, Liu, Shujie, Lin, Edward, Gong, Yifan

We propose gated language experts and curriculum training to enhance multilingual transformer transducer models without requiring language identification (LID) input from users during inference. Our method incorporates a gating mechanism and LID loss

Externí odkaz: http://arxiv.org/abs/2303.00786

Zobrazit plný text záznamu

Report

Improving Contextual Spelling Correction by External Acoustics Attention and Semantic Aware Data Augmentation

Autor: Wang, Xiaoqiang, Liu, Yanqing, Li, Jinyu, Zhao, Sheng

We previously proposed contextual spelling correction (CSC) to correct the output of end-to-end (E2E) automatic speech recognition (ASR) models with contextual information such as name, place, etc. Although CSC has achieved reasonable improvement in

Externí odkaz: http://arxiv.org/abs/2302.11192

Zobrazit plný text záznamu

Report

Speaker Change Detection for Transformer Transducer ASR

Autor: Wu, Jian, Chen, Zhuo, Hu, Min, Xiao, Xiong, Li, Jinyu

Speaker change detection (SCD) is an important feature that improves the readability of the recognized words from an automatic speech recognition (ASR) system by breaking the word sequence into paragraphs at speaker change points. Existing SCD soluti

Externí odkaz: http://arxiv.org/abs/2302.08549

Zobrazit plný text záznamu

Akademický článek

Study on the factors affecting the performance of oil-filled pressure sensitive core

Autor: Jin, Zhong, Li, Xiang, He, Feng, Liu, Fangting, Li, Jinyu, Li, Junhui

Publikováno v: Sensor Review, 2024, Vol. 44, Issue 2, pp. 171-178.

Externí odkaz: http://www.emeraldinsight.com/doi/10.1108/SR-01-2024-0042

Zobrazit plný text záznamu

Report

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Autor: Wang, Chengyi, Chen, Sanyuan, Wu, Yu, Zhang, Ziqiang, Zhou, Long, Liu, Shujie, Chen, Zhuo, Liu, Yanqing, Wang, Huaming, Li, Jinyu, He, Lei, Zhao, Sheng, Wei, Furu

We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called Vall-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a condit

Externí odkaz: http://arxiv.org/abs/2301.02111

Zobrazit plný text záznamu

Report

Fast and accurate factorized neural transducer for text adaption of end-to-end speech recognition models

Autor: Zhao, Rui, Xue, Jian, Parthasarathy, Partha, Miljanic, Veljko, Li, Jinyu

Neural transducer is now the most popular end-to-end model for speech recognition, due to its naturally streaming ability. However, it is challenging to adapt it with text-only data. Factorized neural transducer (FNT) model was proposed to mitigate t

Externí odkaz: http://arxiv.org/abs/2212.01992

Zobrazit plný text záznamu

Report

VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning

Autor: Zhu, Qiushi, Zhou, Long, Zhang, Ziqiang, Liu, Shujie, Jiao, Binxing, Zhang, Jie, Dai, Lirong, Jiang, Daxin, Li, Jinyu, Wei, Furu

Although speech is a simple and effective way for humans to communicate with the outside world, a more realistic speech interaction contains multimodal information, e.g., vision, text. How to design a unified framework to integrate different modal in

Externí odkaz: http://arxiv.org/abs/2211.11275

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání