Výsledky vyhledávání - "Narayan, Sanath"

Report

From Unimodal to Multimodal: Scaling up Projectors to Align Modalities

Autor: Maniparambil, Mayug, Akshulakov, Raiymbek, Djilali, Yasser Abdelaziz Dahou, Narayan, Sanath, Singh, Ankit, O'Connor, Noel E.

Recent contrastive multimodal vision-language models like CLIP have demonstrated robust open-world semantic understanding, becoming the standard image backbones for vision-language applications due to their aligned latent space. However, this practic

Externí odkaz: http://arxiv.org/abs/2409.19425

Zobrazit plný text záznamu

Report

Falcon2-11B Technical Report

Autor: Malartic, Quentin, Chowdhury, Nilabhra Roy, Cojocaru, Ruxandra, Farooq, Mugariya, Campesan, Giulia, Djilali, Yasser Abdelaziz Dahou, Narayan, Sanath, Singh, Ankit, Velikanov, Maksim, Boussaha, Basma El Amel, Al-Yafeai, Mohammed, Alobeidli, Hamza, Qadi, Leen Al, Seddik, Mohamed El Amine, Fedyanin, Kirill, Alami, Reda, Hacid, Hakim

We introduce Falcon2-11B, a foundation model trained on over five trillion tokens, and its multimodal counterpart, Falcon2-11B-vlm, which is a vision-to-text model. We report our findings during the training of the Falcon2-11B which follows a multi-s

Externí odkaz: http://arxiv.org/abs/2407.14885

Zobrazit plný text záznamu

Report

Open-Vocabulary Temporal Action Localization using Multimodal Guidance

Autor: Gupta, Akshita, Arora, Aditya, Narayan, Sanath, Khan, Salman, Khan, Fahad Shahbaz, Taylor, Graham W.

Open-Vocabulary Temporal Action Localization (OVTAL) enables a model to recognize any desired action category in videos without the need to explicitly curate training data for all categories. However, this flexibility poses significant challenges, as

Externí odkaz: http://arxiv.org/abs/2406.15556

Zobrazit plný text záznamu

Report

Efficient 3D-Aware Facial Image Editing via Attribute-Specific Prompt Learning

Autor: Kumar, Amandeep, Awais, Muhammad, Narayan, Sanath, Cholakkal, Hisham, Khan, Salman, Anwer, Rao Muhammad

Drawing upon StyleGAN's expressivity and disentangled latent space, existing 2D approaches employ textual prompting to edit facial images with different attributes. In contrast, 3D-aware approaches that generate faces at different target poses requir

Externí odkaz: http://arxiv.org/abs/2406.04413

Zobrazit plný text záznamu

Report

Multi-modal Generation via Cross-Modal In-Context Learning

Autor: Kumar, Amandeep, Naseer, Muzammal, Narayan, Sanath, Anwer, Rao Muhammad, Khan, Salman, Cholakkal, Hisham

In this work, we study the problem of generating novel images from complex multimodal prompt sequences. While existing methods achieve promising results for text-to-image generation, they often struggle to capture fine-grained details from lengthy pr

Externí odkaz: http://arxiv.org/abs/2405.18304

Zobrazit plný text záznamu

Report

ViSpeR: Multilingual Audio-Visual Speech Recognition

Autor: Narayan, Sanath, Djilali, Yasser Abdelaziz Dahou, Singh, Ankit, Bihan, Eustache Le, Hacid, Hakim

This work presents an extensive and detailed study on Audio-Visual Speech Recognition (AVSR) for five widely spoken languages: Chinese, Spanish, English, Arabic, and French. We have collected large-scale datasets for each language except for English,

Externí odkaz: http://arxiv.org/abs/2406.00038

Zobrazit plný text záznamu

Report

Do Vision and Language Encoders Represent the World Similarly?

Autor: Maniparambil, Mayug, Akshulakov, Raiymbek, Djilali, Yasser Abdelaziz Dahou, Narayan, Sanath, Seddik, Mohamed El Amine, Mangalam, Karttikeya, O'Connor, Noel E.

Aligned text-image encoders such as CLIP have become the de facto model for vision-language tasks. Furthermore, modality-specific encoders achieve impressive performances in their respective domains. This raises a central question: does an alignment

Externí odkaz: http://arxiv.org/abs/2401.05224

Zobrazit plný text záznamu

Report

Do VSR Models Generalize Beyond LRS3?

Autor: Djilali, Yasser Abdelaziz Dahou, Narayan, Sanath, Bihan, Eustache Le, Boussaid, Haithem, Almazrouei, Ebtessam, Debbah, Merouane

The Lip Reading Sentences-3 (LRS3) benchmark has primarily been the focus of intense research in visual speech recognition (VSR) during the last few years. As a result, there is an increased risk of overfitting to its excessively used test set, which

Externí odkaz: http://arxiv.org/abs/2311.14063

Zobrazit plný text záznamu

Report

Lip2Vec: Efficient and Robust Visual Speech Recognition via Latent-to-Latent Visual to Audio Representation Mapping

Autor: Djilali, Yasser Abdelaziz Dahou, Narayan, Sanath, Boussaid, Haithem, Almazrouei, Ebtessam, Debbah, Merouane

Visual Speech Recognition (VSR) differs from the common perception tasks as it requires deeper reasoning over the video sequence, even by human experts. Despite the recent advances in VSR, current approaches rely on labeled data to fully train or fin

Externí odkaz: http://arxiv.org/abs/2308.06112

Zobrazit plný text záznamu

Report

Remote Sensing Change Detection With Transformers Trained from Scratch

Autor: Noman, Mubashir, Fiaz, Mustansar, Cholakkal, Hisham, Narayan, Sanath, Anwer, Rao Muhammad, Khan, Salman, Khan, Fahad Shahbaz

Current transformer-based change detection (CD) approaches either employ a pre-trained model trained on large-scale image classification ImageNet dataset or rely on first pre-training on another CD dataset and then fine-tuning on the target benchmark

Externí odkaz: http://arxiv.org/abs/2304.06710

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání