Zobrazeno 1 - 10
of 51
pro vyhledávání: '"Narayan, Sanath"'
Autor:
Maniparambil, Mayug, Akshulakov, Raiymbek, Djilali, Yasser Abdelaziz Dahou, Narayan, Sanath, Singh, Ankit, O'Connor, Noel E.
Recent contrastive multimodal vision-language models like CLIP have demonstrated robust open-world semantic understanding, becoming the standard image backbones for vision-language applications due to their aligned latent space. However, this practic
Externí odkaz:
http://arxiv.org/abs/2409.19425
Autor:
Malartic, Quentin, Chowdhury, Nilabhra Roy, Cojocaru, Ruxandra, Farooq, Mugariya, Campesan, Giulia, Djilali, Yasser Abdelaziz Dahou, Narayan, Sanath, Singh, Ankit, Velikanov, Maksim, Boussaha, Basma El Amel, Al-Yafeai, Mohammed, Alobeidli, Hamza, Qadi, Leen Al, Seddik, Mohamed El Amine, Fedyanin, Kirill, Alami, Reda, Hacid, Hakim
We introduce Falcon2-11B, a foundation model trained on over five trillion tokens, and its multimodal counterpart, Falcon2-11B-vlm, which is a vision-to-text model. We report our findings during the training of the Falcon2-11B which follows a multi-s
Externí odkaz:
http://arxiv.org/abs/2407.14885
Autor:
Gupta, Akshita, Arora, Aditya, Narayan, Sanath, Khan, Salman, Khan, Fahad Shahbaz, Taylor, Graham W.
Open-Vocabulary Temporal Action Localization (OVTAL) enables a model to recognize any desired action category in videos without the need to explicitly curate training data for all categories. However, this flexibility poses significant challenges, as
Externí odkaz:
http://arxiv.org/abs/2406.15556
Autor:
Kumar, Amandeep, Awais, Muhammad, Narayan, Sanath, Cholakkal, Hisham, Khan, Salman, Anwer, Rao Muhammad
Drawing upon StyleGAN's expressivity and disentangled latent space, existing 2D approaches employ textual prompting to edit facial images with different attributes. In contrast, 3D-aware approaches that generate faces at different target poses requir
Externí odkaz:
http://arxiv.org/abs/2406.04413
Autor:
Kumar, Amandeep, Naseer, Muzammal, Narayan, Sanath, Anwer, Rao Muhammad, Khan, Salman, Cholakkal, Hisham
In this work, we study the problem of generating novel images from complex multimodal prompt sequences. While existing methods achieve promising results for text-to-image generation, they often struggle to capture fine-grained details from lengthy pr
Externí odkaz:
http://arxiv.org/abs/2405.18304
Autor:
Narayan, Sanath, Djilali, Yasser Abdelaziz Dahou, Singh, Ankit, Bihan, Eustache Le, Hacid, Hakim
This work presents an extensive and detailed study on Audio-Visual Speech Recognition (AVSR) for five widely spoken languages: Chinese, Spanish, English, Arabic, and French. We have collected large-scale datasets for each language except for English,
Externí odkaz:
http://arxiv.org/abs/2406.00038
Autor:
Maniparambil, Mayug, Akshulakov, Raiymbek, Djilali, Yasser Abdelaziz Dahou, Narayan, Sanath, Seddik, Mohamed El Amine, Mangalam, Karttikeya, O'Connor, Noel E.
Aligned text-image encoders such as CLIP have become the de facto model for vision-language tasks. Furthermore, modality-specific encoders achieve impressive performances in their respective domains. This raises a central question: does an alignment
Externí odkaz:
http://arxiv.org/abs/2401.05224
Autor:
Djilali, Yasser Abdelaziz Dahou, Narayan, Sanath, Bihan, Eustache Le, Boussaid, Haithem, Almazrouei, Ebtessam, Debbah, Merouane
The Lip Reading Sentences-3 (LRS3) benchmark has primarily been the focus of intense research in visual speech recognition (VSR) during the last few years. As a result, there is an increased risk of overfitting to its excessively used test set, which
Externí odkaz:
http://arxiv.org/abs/2311.14063
Autor:
Djilali, Yasser Abdelaziz Dahou, Narayan, Sanath, Boussaid, Haithem, Almazrouei, Ebtessam, Debbah, Merouane
Visual Speech Recognition (VSR) differs from the common perception tasks as it requires deeper reasoning over the video sequence, even by human experts. Despite the recent advances in VSR, current approaches rely on labeled data to fully train or fin
Externí odkaz:
http://arxiv.org/abs/2308.06112
Autor:
Noman, Mubashir, Fiaz, Mustansar, Cholakkal, Hisham, Narayan, Sanath, Anwer, Rao Muhammad, Khan, Salman, Khan, Fahad Shahbaz
Current transformer-based change detection (CD) approaches either employ a pre-trained model trained on large-scale image classification ImageNet dataset or rely on first pre-training on another CD dataset and then fine-tuning on the target benchmark
Externí odkaz:
http://arxiv.org/abs/2304.06710