Zobrazeno 1 - 10
of 1 666
pro vyhledávání: '"Manocha P"'
Autor:
Suri, Manan, Mathur, Puneet, Dernoncourt, Franck, Goswami, Kanika, Rossi, Ryan A., Manocha, Dinesh
Understanding information from a collection of multiple documents, particularly those with visually rich elements, is important for document-grounded question answering. This paper introduces VisDoMBench, the first comprehensive benchmark designed to
Externí odkaz:
http://arxiv.org/abs/2412.10704
The field of text-to-audio generation has seen significant advancements, and yet the ability to finely control the acoustic characteristics of generated audio remains under-explored. In this paper, we introduce a novel yet simple approach to generate
Externí odkaz:
http://arxiv.org/abs/2412.09789
Large Language Models (LLMs) have recently demonstrated impressive few-shot learning capabilities through in-context learning (ICL). However, ICL performance is highly dependent on the choice of few-shot demonstrations, making the selection of the mo
Externí odkaz:
http://arxiv.org/abs/2412.05710
Autor:
Ghosal, Soumya Suvra, Chakraborty, Souradip, Singh, Vaibhav, Guan, Tianrui, Wang, Mengdi, Beirami, Ahmad, Huang, Furong, Velasquez, Alvaro, Manocha, Dinesh, Bedi, Amrit Singh
With the widespread deployment of Multimodal Large Language Models (MLLMs) for visual-reasoning tasks, improving their safety has become crucial. Recent research indicates that despite training-time safety alignment, these models remain vulnerable to
Externí odkaz:
http://arxiv.org/abs/2411.18688
Autor:
Sakshi, S, Tyagi, Utkarsh, Kumar, Sonal, Seth, Ashish, Selvakumar, Ramaneswaran, Nieto, Oriol, Duraiswami, Ramani, Ghosh, Sreyan, Manocha, Dinesh
The ability to comprehend audio--which includes speech, non-speech sounds, and music--is crucial for AI agents to interact effectively with the world. We present MMAU, a novel benchmark designed to evaluate multimodal audio understanding models on ta
Externí odkaz:
http://arxiv.org/abs/2410.19168
Autor:
Selvakumar, Ramaneswaran, Kumar, Sonal, Giri, Hemant Kumar, Anand, Nishit, Seth, Ashish, Ghosh, Sreyan, Manocha, Dinesh
Open-vocabulary audio language models (ALMs), like Contrastive Language Audio Pretraining (CLAP), represent a promising new paradigm for audio-text retrieval using natural language queries. In this paper, for the first time, we perform controlled exp
Externí odkaz:
http://arxiv.org/abs/2410.16505
Autor:
Suri, Manan, Mathur, Puneet, Dernoncourt, Franck, Jain, Rajiv, Morariu, Vlad I, Sawhney, Ramit, Nakov, Preslav, Manocha, Dinesh
Document structure editing involves manipulating localized textual, visual, and layout components in document images based on the user's requests. Past works have shown that multimodal grounding of user requests in the document image and identifying
Externí odkaz:
http://arxiv.org/abs/2410.16472
Audio-Language Models (ALMs) have demonstrated remarkable performance in zero-shot audio classification. In this paper, we introduce PAT (Parameter-free Audio-Text aligner), a simple and training-free method aimed at boosting the zero-shot audio clas
Externí odkaz:
http://arxiv.org/abs/2410.15062
Autor:
Ghosh, Sreyan, Rasooli, Mohammad Sadegh, Levit, Michael, Wang, Peidong, Xue, Jian, Manocha, Dinesh, Li, Jinyu
Generative Error Correction (GEC) has emerged as a powerful post-processing method to enhance the performance of Automatic Speech Recognition (ASR) systems. However, we show that GEC models struggle to generalize beyond the specific types of errors e
Externí odkaz:
http://arxiv.org/abs/2410.13198
Autor:
Seth, Ashish, Selvakumar, Ramaneswaran, Sakshi, S, Kumar, Sonal, Ghosh, Sreyan, Manocha, Dinesh
In this paper, we present EH-MAM (Easy-to-Hard adaptive Masked Acoustic Modeling), a novel self-supervised learning approach for speech representation learning. In contrast to the prior methods that use random masking schemes for Masked Acoustic Mode
Externí odkaz:
http://arxiv.org/abs/2410.13179