Zobrazeno 1 - 10
of 249
pro vyhledávání: '"Feris Rogerio"'
Although open-vocabulary classification models like Contrastive Language Image Pretraining (CLIP) have demonstrated strong zero-shot learning capabilities, their robustness to common image corruptions remains poorly understood. Through extensive expe
Externí odkaz:
http://arxiv.org/abs/2412.02837
Autor:
Mitra, Chancharik, Huang, Brandon, Chai, Tianning, Lin, Zhiqiu, Arbelle, Assaf, Feris, Rogerio, Karlinsky, Leonid, Darrell, Trevor, Ramanan, Deva, Herzig, Roei
Generative Large Multimodal Models (LMMs) like LLaVA and Qwen-VL excel at a wide variety of vision-language (VL) tasks such as image captioning or visual question answering. Despite strong performance, LMMs are not directly suited for foundational di
Externí odkaz:
http://arxiv.org/abs/2412.00142
Autor:
Bhati, Saurabhchand, Gong, Yuan, Karlinsky, Leonid, Kuehne, Hilde, Feris, Rogerio, Glass, James
Large Audio Language Models (LALM) combine the audio perception models and the Large Language Models (LLM) and show a remarkable ability to reason about the input audio, infer the meaning, and understand the intent. However, these systems rely on Tra
Externí odkaz:
http://arxiv.org/abs/2411.15685
Autor:
Doveh, Sivan, Shabtay, Nimrod, Lin, Wei, Schwartz, Eli, Kuehne, Hilde, Giryes, Raja, Feris, Rogerio, Karlinsky, Leonid, Glass, James, Arbelle, Assaf, Ullman, Shimon, Mirza, M. Jehanzeb
Vision-Language Models (VLMs) have shown remarkable capabilities across diverse visual tasks, including image recognition, video understanding, and Visual Question Answering (VQA) when explicitly trained for these tasks. Despite these advances, we fi
Externí odkaz:
http://arxiv.org/abs/2411.13317
Autor:
Mirza, M. Jehanzeb, Zhao, Mengjie, Mao, Zhuoyuan, Doveh, Sivan, Lin, Wei, Gavrikov, Paul, Dorkenwald, Michael, Yang, Shiqi, Jha, Saurav, Wakaki, Hiromi, Mitsufuji, Yuki, Possegger, Horst, Feris, Rogerio, Karlinsky, Leonid, Glass, James
In this work, we propose a novel method (GLOV) enabling Large Language Models (LLMs) to act as implicit Optimizers for Vision-Langugage Models (VLMs) to enhance downstream vision tasks. Our GLOV meta-prompts an LLM with the downstream task descriptio
Externí odkaz:
http://arxiv.org/abs/2410.06154
Autor:
Stallone, Matt, Saxena, Vaibhav, Karlinsky, Leonid, McGinn, Bridget, Bula, Tim, Mishra, Mayank, Soria, Adriana Meza, Zhang, Gaoyuan, Prasad, Aditya, Shen, Yikang, Surendran, Saptha, Guttula, Shanmukha, Patel, Hima, Selvam, Parameswaran, Dang, Xuan-Hong, Koyfman, Yan, Sood, Atin, Feris, Rogerio, Desai, Nirmit, Cox, David D., Puri, Ruchir, Panda, Rameswar
This paper introduces long-context Granite code models that support effective context windows of up to 128K tokens. Our solution for scaling context length of Granite 3B/8B code models from 2K/4K to 128K consists of a light-weight continual pretraini
Externí odkaz:
http://arxiv.org/abs/2407.13739
Autor:
Bhati, Saurabhchand, Gong, Yuan, Karlinsky, Leonid, Kuehne, Hilde, Feris, Rogerio, Glass, James
State-space models (SSMs) have emerged as an alternative to Transformers for audio modeling due to their high computational efficiency with long inputs. While recent efforts on Audio SSMs have reported encouraging results, two main limitations remain
Externí odkaz:
http://arxiv.org/abs/2407.04082
Recently, Large Language Models (LLMs) attained impressive performance in math and reasoning benchmarks. However, they still often struggle with logic problems and puzzles that are relatively easy for humans. To further investigate this, we introduce
Externí odkaz:
http://arxiv.org/abs/2406.12172
Autor:
Kang, Junmo, Karlinsky, Leonid, Luo, Hongyin, Wang, Zhen, Hansen, Jacob, Glass, James, Cox, David, Panda, Rameswar, Feris, Rogerio, Ritter, Alan
We present Self-MoE, an approach that transforms a monolithic LLM into a compositional, modular system of self-specialized experts, named MiXSE (MiXture of Self-specialized Experts). Our approach leverages self-specialization, which constructs expert
Externí odkaz:
http://arxiv.org/abs/2406.12034
Autor:
Rouditchenko, Andrew, Gong, Yuan, Thomas, Samuel, Karlinsky, Leonid, Kuehne, Hilde, Feris, Rogerio, Glass, James
Audio-Visual Speech Recognition (AVSR) uses lip-based video to improve performance in noise. Since videos are harder to obtain than audio, the video training data of AVSR models is usually limited to a few thousand hours. In contrast, speech models s
Externí odkaz:
http://arxiv.org/abs/2406.10082