Výsledky vyhledávání

Report

VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation

Autor: Suri, Manan, Mathur, Puneet, Dernoncourt, Franck, Goswami, Kanika, Rossi, Ryan A., Manocha, Dinesh

Understanding information from a collection of multiple documents, particularly those with visually rich elements, is important for document-grounded question answering. This paper introduces VisDoMBench, the first comprehensive benchmark designed to

Externí odkaz: http://arxiv.org/abs/2412.10704

Zobrazit plný text záznamu

Report

SILA: Signal-to-Language Augmentation for Enhanced Control in Text-to-Audio Generation

Autor: Kumar, Sonal, Seetharaman, Prem, Salamon, Justin, Manocha, Dinesh, Nieto, Oriol

The field of text-to-audio generation has seen significant advancements, and yet the ability to finely control the acoustic characteristics of generated audio remains under-explored. In this paper, we introduce a novel yet simple approach to generate

Externí odkaz: http://arxiv.org/abs/2412.09789

Zobrazit plný text záznamu

Report

PromptRefine: Enhancing Few-Shot Performance on Low-Resource Indic Languages with Example Selection from Related Example Banks

Autor: Ghosal, Soumya Suvra, Pal, Soumyabrata, Mukherjee, Koyel, Manocha, Dinesh

Large Language Models (LLMs) have recently demonstrated impressive few-shot learning capabilities through in-context learning (ICL). However, ICL performance is highly dependent on the choice of few-shot demonstrations, making the selection of the mo

Externí odkaz: http://arxiv.org/abs/2412.05710

Zobrazit plný text záznamu

Report

Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment

Autor: Ghosal, Soumya Suvra, Chakraborty, Souradip, Singh, Vaibhav, Guan, Tianrui, Wang, Mengdi, Beirami, Ahmad, Huang, Furong, Velasquez, Alvaro, Manocha, Dinesh, Bedi, Amrit Singh

With the widespread deployment of Multimodal Large Language Models (MLLMs) for visual-reasoning tasks, improving their safety has become crucial. Recent research indicates that despite training-time safety alignment, these models remain vulnerable to

Externí odkaz: http://arxiv.org/abs/2411.18688

Zobrazit plný text záznamu

Report

MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

Autor: Sakshi, S, Tyagi, Utkarsh, Kumar, Sonal, Seth, Ashish, Selvakumar, Ramaneswaran, Nieto, Oriol, Duraiswami, Ramani, Ghosh, Sreyan, Manocha, Dinesh

The ability to comprehend audio--which includes speech, non-speech sounds, and music--is crucial for AI agents to interact effectively with the world. We present MMAU, a novel benchmark designed to evaluate multimodal audio understanding models on ta

Externí odkaz: http://arxiv.org/abs/2410.19168

Zobrazit plný text záznamu

Report

Do Audio-Language Models Understand Linguistic Variations?

Autor: Selvakumar, Ramaneswaran, Kumar, Sonal, Giri, Hemant Kumar, Anand, Nishit, Seth, Ashish, Ghosh, Sreyan, Manocha, Dinesh

Open-vocabulary audio language models (ALMs), like Contrastive Language Audio Pretraining (CLAP), represent a promising new paradigm for audio-text retrieval using natural language queries. In this paper, for the first time, we perform controlled exp

Externí odkaz: http://arxiv.org/abs/2410.16505

Zobrazit plný text záznamu

Report

DocEdit-v2: Document Structure Editing Via Multimodal LLM Grounding

Autor: Suri, Manan, Mathur, Puneet, Dernoncourt, Franck, Jain, Rajiv, Morariu, Vlad I, Sawhney, Ramit, Nakov, Preslav, Manocha, Dinesh

Document structure editing involves manipulating localized textual, visual, and layout components in document images based on the user's requests. Past works have shown that multimodal grounding of user requests in the document image and identifying

Externí odkaz: http://arxiv.org/abs/2410.16472

Zobrazit plný text záznamu

Report

PAT: Parameter-Free Audio-Text Aligner to Boost Zero-Shot Audio Classification

Autor: Seth, Ashish, Selvakumar, Ramaneswaran, Kumar, Sonal, Ghosh, Sreyan, Manocha, Dinesh

Audio-Language Models (ALMs) have demonstrated remarkable performance in zero-shot audio classification. In this paper, we introduce PAT (Parameter-free Audio-Text aligner), a simple and training-free method aimed at boosting the zero-shot audio clas

Externí odkaz: http://arxiv.org/abs/2410.15062

Zobrazit plný text záznamu

Report

Failing Forward: Improving Generative Error Correction for ASR with Synthetic Data and Retrieval Augmentation

Autor: Ghosh, Sreyan, Rasooli, Mohammad Sadegh, Levit, Michael, Wang, Peidong, Xue, Jian, Manocha, Dinesh, Li, Jinyu

Generative Error Correction (GEC) has emerged as a powerful post-processing method to enhance the performance of Automatic Speech Recognition (ASR) systems. However, we show that GEC models struggle to generalize beyond the specific types of errors e

Externí odkaz: http://arxiv.org/abs/2410.13198

Zobrazit plný text záznamu

Report

EH-MAM: Easy-to-Hard Masked Acoustic Modeling for Self-Supervised Speech Representation Learning

Autor: Seth, Ashish, Selvakumar, Ramaneswaran, Sakshi, S, Kumar, Sonal, Ghosh, Sreyan, Manocha, Dinesh

In this paper, we present EH-MAM (Easy-to-Hard adaptive Masked Acoustic Modeling), a novel self-supervised learning approach for speech representation learning. In contrast to the prior methods that use random masking schemes for Masked Acoustic Mode

Externí odkaz: http://arxiv.org/abs/2410.13179

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání