Zobrazeno 1 - 10
of 121
pro vyhledávání: '"Krishna, Ranjay"'
Several benchmarks have concluded that our best vision-language models (e.g., CLIP) are lacking in compositionality. Given an image, these benchmarks probe a model's ability to identify its associated caption amongst a set of compositional distractor
Externí odkaz:
http://arxiv.org/abs/2409.17958
Autor:
Deitke, Matt, Clark, Christopher, Lee, Sangho, Tripathi, Rohun, Yang, Yue, Park, Jae Sung, Salehi, Mohammadreza, Muennighoff, Niklas, Lo, Kyle, Soldaini, Luca, Lu, Jiasen, Anderson, Taira, Bransom, Erin, Ehsani, Kiana, Ngo, Huong, Chen, YenSung, Patel, Ajay, Yatskar, Mark, Callison-Burch, Chris, Head, Andrew, Hendrix, Rose, Bastani, Favyen, VanderBilt, Eli, Lambert, Nathan, Chou, Yvonne, Chheda, Arnavi, Sparks, Jenna, Skjonsberg, Sam, Schmitz, Michael, Sarnat, Aaron, Bischoff, Byron, Walsh, Pete, Newell, Chris, Wolters, Piper, Gupta, Tanmay, Zeng, Kuo-Hao, Borchardt, Jon, Groeneveld, Dirk, Dumas, Jen, Nam, Crystal, Lebrecht, Sophie, Wittlif, Caitlin, Schoenick, Carissa, Michel, Oscar, Krishna, Ranjay, Weihs, Luca, Smith, Noah A., Hajishirzi, Hannaneh, Girshick, Ross, Farhadi, Ali, Kembhavi, Aniruddha
Today's most advanced multimodal models remain proprietary. The strongest open-weight models rely heavily on synthetic data from proprietary VLMs to achieve good performance, effectively distilling these closed models into open ones. As a result, the
Externí odkaz:
http://arxiv.org/abs/2409.17146
Complex video queries can be answered by decomposing them into modular subtasks. However, existing video data management systems assume the existence of predefined modules for each subtask. We introduce VOCAL-UDF, a novel self-enhancing system that s
Externí odkaz:
http://arxiv.org/abs/2408.02243
Autor:
Liu, Benlin, Dong, Yuhao, Wang, Yiqin, Rao, Yongming, Tang, Yansong, Ma, Wei-Chiu, Krishna, Ranjay
Multimodal language models (MLLMs) are increasingly being implemented in real-world environments, necessitating their ability to interpret 3D spaces and comprehend temporal dynamics. Despite their potential, current top models within our community st
Externí odkaz:
http://arxiv.org/abs/2408.00754
Autor:
Liu, Zuyan, Liu, Benlin, Wang, Jiahui, Dong, Yuhao, Chen, Guangyi, Rao, Yongming, Krishna, Ranjay, Lu, Jiwen
In the field of instruction-following large vision-language models (LVLMs), the efficient deployment of these models faces challenges, notably due to the high memory demands of their key-value (KV) caches. Conventional cache management strategies for
Externí odkaz:
http://arxiv.org/abs/2407.18121
Autor:
Hsieh, Yu-Guan, Hsieh, Cheng-Yu, Yeh, Shih-Ying, Béthune, Louis, Ansari, Hadi Pour, Vasu, Pavan Kumar Anasosalu, Li, Chun-Liang, Krishna, Ranjay, Tuzel, Oncel, Cuturi, Marco
Humans describe complex scenes with compositionality, using simple text descriptions enriched with links and relationships. While vision-language research has aimed to develop models with compositional understanding capabilities, this is not reflecte
Externí odkaz:
http://arxiv.org/abs/2407.06723
When asked to summarize articles or answer questions given a passage, large language models (LLMs) can hallucinate details and respond with unsubstantiated answers that are inaccurate with respect to the input context. This paper describes a simple a
Externí odkaz:
http://arxiv.org/abs/2407.07071
Autor:
Duan, Jiafei, Yuan, Wentao, Pumacay, Wilbert, Wang, Yi Ru, Ehsani, Kiana, Fox, Dieter, Krishna, Ranjay
Large-scale endeavors like and widespread community efforts such as Open-X-Embodiment have contributed to growing the scale of robot demonstration data. However, there is still an opportunity to improve the quality, quantity, and diversity of robot d
Externí odkaz:
http://arxiv.org/abs/2406.18915
Autor:
Hsieh, Cheng-Yu, Chuang, Yung-Sung, Li, Chun-Liang, Wang, Zifeng, Le, Long T., Kumar, Abhishek, Glass, James, Ratner, Alexander, Lee, Chen-Yu, Krishna, Ranjay, Pfister, Tomas
Large language models (LLMs), even when specifically trained to process long input contexts, struggle to capture relevant information located in the middle of their input. This phenomenon has been known as the lost-in-the-middle problem. In this work
Externí odkaz:
http://arxiv.org/abs/2406.16008
Autor:
Zhang, Jieyu, Huang, Weikai, Ma, Zixian, Michel, Oscar, He, Dong, Gupta, Tanmay, Ma, Wei-Chiu, Farhadi, Ali, Kembhavi, Aniruddha, Krishna, Ranjay
Benchmarks for large multimodal language models (MLMs) now serve to simultaneously assess the general capabilities of models instead of evaluating for a specific capability. As a result, when a developer wants to identify which models to use for thei
Externí odkaz:
http://arxiv.org/abs/2406.11775