Výsledky vyhledávání

Report

SpeechGuard: Exploring the Adversarial Robustness of Multimodal Large Language Models

Autor: Peri, Raghuveer, Jayanthi, Sai Muralidhar, Ronanki, Srikanth, Bhatia, Anshu, Mundnich, Karel, Dingliwal, Saket, Das, Nilaksh, Hou, Zejiang, Huybrechts, Goeric, Vishnubhotla, Srikanth, Garcia-Romero, Daniel, Srinivasan, Sundararajan, Han, Kyu J, Kirchhoff, Katrin

Integrated Speech and Large Language Models (SLMs) that can follow speech instructions and generate relevant text responses have gained popularity lately. However, the safety and robustness of these models remains largely unclear. In this work, we in

Externí odkaz: http://arxiv.org/abs/2405.08317

Zobrazit plný text záznamu

Report

SpeechVerse: A Large-scale Generalizable Audio Language Model

Autor: Das, Nilaksh, Dingliwal, Saket, Ronanki, Srikanth, Paturi, Rohit, Huang, Zhaocheng, Mathur, Prashant, Yuan, Jie, Bekal, Dhanush, Niu, Xing, Jayanthi, Sai Muralidhar, Li, Xilai, Mundnich, Karel, Sunkara, Monica, Srinivasan, Sundararajan, Han, Kyu J, Kirchhoff, Katrin

Large language models (LLMs) have shown incredible proficiency in performing tasks that require semantic understanding of natural language instructions. Recently, many works have further expanded this capability to perceive multimodal audio and text

Externí odkaz: http://arxiv.org/abs/2405.08295

Zobrazit plný text záznamu

Report

PEAVS: Perceptual Evaluation of Audio-Visual Synchrony Grounded in Viewers' Opinion Scores

Autor: Goncalves, Lucas, Mathur, Prashant, Lavania, Chandrashekhar, Cekic, Metehan, Federico, Marcello, Han, Kyu J.

Recent advancements in audio-visual generative modeling have been propelled by progress in deep learning and the availability of data-rich benchmarks. However, the growth is not attributed solely to models and benchmarks. Universally accepted evaluat

Externí odkaz: http://arxiv.org/abs/2404.07336

Zobrazit plný text záznamu

Report

E-Branchformer: Branchformer with Enhanced merging for speech recognition

Autor: Kim, Kwangyoun, Wu, Felix, Peng, Yifan, Pan, Jing, Sridhar, Prashant, Han, Kyu J., Watanabe, Shinji

Conformer, combining convolution and self-attention sequentially to capture both local and global information, has shown remarkable performance and is currently regarded as the state-of-the-art for automatic speech recognition (ASR). Several other st

Externí odkaz: http://arxiv.org/abs/2210.00077

Zobrazit plný text záznamu

Report

On the Use of External Data for Spoken Named Entity Recognition

Autor: Pasad, Ankita, Wu, Felix, Shon, Suwon, Livescu, Karen, Han, Kyu J.

Spoken language understanding (SLU) tasks involve mapping from speech audio signals to semantic labels. Given the complexity of such tasks, good performance might be expected to require large labeled datasets, which are difficult to collect for each

Externí odkaz: http://arxiv.org/abs/2112.07648

Zobrazit plný text záznamu

Report

SLUE: New Benchmark Tasks for Spoken Language Understanding Evaluation on Natural Speech

Autor: Shon, Suwon, Pasad, Ankita, Wu, Felix, Brusco, Pablo, Artzi, Yoav, Livescu, Karen, Han, Kyu J.

Progress in speech processing has been facilitated by shared datasets and benchmarks. Historically these have focused on automatic speech recognition (ASR), speaker identification, or other lower-level tasks. Interest has been growing in higher-level

Externí odkaz: http://arxiv.org/abs/2111.10367

Zobrazit plný text záznamu

Report

Multi-mode Transformer Transducer with Stochastic Future Context

Autor: Kim, Kwangyoun, Wu, Felix, Sridhar, Prashant, Han, Kyu J., Watanabe, Shinji

Automatic speech recognition (ASR) models make fewer errors when more surrounding speech information is presented as context. Unfortunately, acquiring a larger future context leads to higher latency. There exists an inevitable trade-off between speed

Externí odkaz: http://arxiv.org/abs/2106.09760

Zobrazit plný text záznamu

Report

Leveraging Pre-trained Language Model for Speech Sentiment Analysis

Autor: Shon, Suwon, Brusco, Pablo, Pan, Jing, Han, Kyu J., Watanabe, Shinji

In this paper, we explore the use of pre-trained language models to learn sentiment information of written texts for speech sentiment analysis. First, we investigate how useful a pre-trained language model would be in a 2-step pipeline approach emplo

Externí odkaz: http://arxiv.org/abs/2106.06598

Zobrazit plný text záznamu

Report

A Review of Speaker Diarization: Recent Advances with Deep Learning

Autor: Park, Tae Jin, Kanda, Naoyuki, Dimitriadis, Dimitrios, Han, Kyu J., Watanabe, Shinji, Narayanan, Shrikanth

Speaker diarization is a task to label audio or video recordings with classes that correspond to speaker identity, or in short, a task to identify "who spoke when". In the early years, speaker diarization algorithms were developed for speech recognit

Externí odkaz: http://arxiv.org/abs/2101.09624

Zobrazit plný text záznamu

Report

ASAPP-ASR: Multistream CNN and Self-Attentive SRU for SOTA Speech Recognition

Autor: Pan, Jing, Shapiro, Joshua, Wohlwend, Jeremy, Han, Kyu J., Lei, Tao, Ma, Tao

In this paper we present state-of-the-art (SOTA) performance on the LibriSpeech corpus with two novel neural network architectures, a multistream CNN for acoustic modeling and a self-attentive simple recurrent unit (SRU) for language modeling. In the

Externí odkaz: http://arxiv.org/abs/2005.10469

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání