Zobrazeno 1 - 10
of 72
pro vyhledávání: '"Seth Ashish"'
Autor:
Sakshi, S, Tyagi, Utkarsh, Kumar, Sonal, Seth, Ashish, Selvakumar, Ramaneswaran, Nieto, Oriol, Duraiswami, Ramani, Ghosh, Sreyan, Manocha, Dinesh
The ability to comprehend audio--which includes speech, non-speech sounds, and music--is crucial for AI agents to interact effectively with the world. We present MMAU, a novel benchmark designed to evaluate multimodal audio understanding models on ta
Externí odkaz:
http://arxiv.org/abs/2410.19168
Autor:
Selvakumar, Ramaneswaran, Kumar, Sonal, Giri, Hemant Kumar, Anand, Nishit, Seth, Ashish, Ghosh, Sreyan, Manocha, Dinesh
Open-vocabulary audio language models (ALMs), like Contrastive Language Audio Pretraining (CLAP), represent a promising new paradigm for audio-text retrieval using natural language queries. In this paper, for the first time, we perform controlled exp
Externí odkaz:
http://arxiv.org/abs/2410.16505
Audio-Language Models (ALMs) have demonstrated remarkable performance in zero-shot audio classification. In this paper, we introduce PAT (Parameter-free Audio-Text aligner), a simple and training-free method aimed at boosting the zero-shot audio clas
Externí odkaz:
http://arxiv.org/abs/2410.15062
Autor:
Seth, Ashish, Selvakumar, Ramaneswaran, Sakshi, S, Kumar, Sonal, Ghosh, Sreyan, Manocha, Dinesh
In this paper, we present EH-MAM (Easy-to-Hard adaptive Masked Acoustic Modeling), a novel self-supervised learning approach for speech representation learning. In contrast to the prior methods that use random masking schemes for Masked Acoustic Mode
Externí odkaz:
http://arxiv.org/abs/2410.13179
GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities
Autor:
Ghosh, Sreyan, Kumar, Sonal, Seth, Ashish, Evuru, Chandra Kiran Reddy, Tyagi, Utkarsh, Sakshi, S, Nieto, Oriol, Duraiswami, Ramani, Manocha, Dinesh
Perceiving and understanding non-speech sounds and non-verbal speech is essential to making decisions that help us interact with our surroundings. In this paper, we propose GAMA, a novel General-purpose Large Audio-Language Model (LALM) with Advanced
Externí odkaz:
http://arxiv.org/abs/2406.11768
Autor:
Ghosh, Sreyan, Kumar, Sonal, Seth, Ashish, Chiniya, Purva, Tyagi, Utkarsh, Duraiswami, Ramani, Manocha, Dinesh
Visual cues, like lip motion, have been shown to improve the performance of Automatic Speech Recognition (ASR) systems in noisy environments. We propose LipGER (Lip Motion aided Generative Error Correction), a novel framework for leveraging visual cu
Externí odkaz:
http://arxiv.org/abs/2406.04432
Continued pre-training (CP) offers multiple advantages, like target domain adaptation and the potential to exploit the continuous stream of unlabeled data available online. However, continued pre-training on out-of-domain distributions often leads to
Externí odkaz:
http://arxiv.org/abs/2312.13026
Continued self-supervised (SSL) pre-training for adapting existing SSL models to the target domain has shown to be extremely effective for low-resource Automatic Speech Recognition (ASR). This paper proposes Stable Distillation, a simple and novel ap
Externí odkaz:
http://arxiv.org/abs/2312.12783
Autor:
Ghosh, Sreyan, Seth, Ashish, Kumar, Sonal, Tyagi, Utkarsh, Evuru, Chandra Kiran, Ramaneswaran, S., Sakshi, S., Nieto, Oriol, Duraiswami, Ramani, Manocha, Dinesh
A fundamental characteristic of audio is its compositional nature. Audio-language models (ALMs) trained using a contrastive approach (e.g., CLAP) that learns a shared representation between audio and language modalities have improved performance in m
Externí odkaz:
http://arxiv.org/abs/2310.08753