Výsledky vyhledávání

Report

LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content

Autor: Shabtay, Nimrod, Polo, Felipe Maia, Doveh, Sivan, Lin, Wei, Mirza, M. Jehanzeb, Chosen, Leshem, Yurochkin, Mikhail, Sun, Yuekai, Arbelle, Assaf, Karlinsky, Leonid, Giryes, Raja

The large-scale training of multi-modal models on data scraped from the web has shown outstanding utility in infusing these models with the required world knowledge to perform effectively on multiple downstream tasks. However, one downside of scrapin

Externí odkaz: http://arxiv.org/abs/2410.10783

Zobrazit plný text záznamu

Report

GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models

Autor: Mirza, M. Jehanzeb, Zhao, Mengjie, Mao, Zhuoyuan, Doveh, Sivan, Lin, Wei, Gavrikov, Paul, Dorkenwald, Michael, Yang, Shiqi, Jha, Saurav, Wakaki, Hiromi, Mitsufuji, Yuki, Possegger, Horst, Feris, Rogerio, Karlinsky, Leonid, Glass, James

In this work, we propose a novel method (GLOV) enabling Large Language Models (LLMs) to act as implicit Optimizers for Vision-Langugage Models (VLMs) to enhance downstream vision tasks. Our GLOV meta-prompts an LLM with the downstream task descriptio

Externí odkaz: http://arxiv.org/abs/2410.06154

Zobrazit plný text záznamu

Report

Scaling Granite Code Models to 128K Context

This paper introduces long-context Granite code models that support effective context windows of up to 128K tokens. Our solution for scaling context length of Granite 3B/8B code models from 2K/4K to 128K consists of a light-weight continual pretraini

Externí odkaz: http://arxiv.org/abs/2407.13739

Zobrazit plný text záznamu

Report

DASS: Distilled Audio State Space Models Are Stronger and More Duration-Scalable Learners

Autor: Bhati, Saurabhchand, Gong, Yuan, Karlinsky, Leonid, Kuehne, Hilde, Feris, Rogerio, Glass, James

State-space models (SSMs) have emerged as an alternative to Transformers for audio modeling due to their high computational efficiency with long inputs. While recent efforts on Audio SSMs have reported encouraging results, two main limitations remain

Externí odkaz: http://arxiv.org/abs/2407.04082

Zobrazit plný text záznamu

Report

Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning

Autor: Huang, Brandon, Mitra, Chancharik, Arbelle, Assaf, Karlinsky, Leonid, Darrell, Trevor, Herzig, Roei

The recent success of interleaved Large Multimodal Models (LMMs) in few-shot learning suggests that in-context learning (ICL) with many examples can be promising for learning new tasks. However, this many-shot multimodal ICL setting has one crucial p

Externí odkaz: http://arxiv.org/abs/2406.15334

Zobrazit plný text záznamu

Report

Navigating the Labyrinth: Evaluating and Enhancing LLMs' Ability to Reason About Search Problems

Autor: Borazjanizadeh, Nasim, Herzig, Roei, Darrell, Trevor, Feris, Rogerio, Karlinsky, Leonid

Recently, Large Language Models (LLMs) attained impressive performance in math and reasoning benchmarks. However, they still often struggle with logic problems and puzzles that are relatively easy for humans. To further investigate this, we introduce

Externí odkaz: http://arxiv.org/abs/2406.12172

Zobrazit plný text záznamu

Report

Self-MoE: Towards Compositional Large Language Models with Self-Specialized Experts

Autor: Kang, Junmo, Karlinsky, Leonid, Luo, Hongyin, Wang, Zhen, Hansen, Jacob, Glass, James, Cox, David, Panda, Rameswar, Feris, Rogerio, Ritter, Alan

We present Self-MoE, an approach that transforms a monolithic LLM into a compositional, modular system of self-specialized experts, named MiXSE (MiXture of Self-specialized Experts). Our approach leverages self-specialization, which constructs expert

Externí odkaz: http://arxiv.org/abs/2406.12034

Zobrazit plný text záznamu

Report

Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation

Autor: Rouditchenko, Andrew, Gong, Yuan, Thomas, Samuel, Karlinsky, Leonid, Kuehne, Hilde, Feris, Rogerio, Glass, James

Audio-Visual Speech Recognition (AVSR) uses lip-based video to improve performance in noise. Since videos are harder to obtain than audio, the video training data of AVSR models is usually limited to a few thousand hours. In contrast, speech models s

Externí odkaz: http://arxiv.org/abs/2406.10082

Zobrazit plný text záznamu

Report

Comparison Visual Instruction Tuning

Autor: Lin, Wei, Mirza, Muhammad Jehanzeb, Doveh, Sivan, Feris, Rogerio, Giryes, Raja, Hochreiter, Sepp, Karlinsky, Leonid

Comparing two images in terms of Commonalities and Differences (CaD) is a fundamental human capability that forms the basis of advanced visual reasoning and interpretation. It is essential for the generation of detailed and contextually relevant desc

Externí odkaz: http://arxiv.org/abs/2406.09240

Zobrazit plný text záznamu

Report

ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMs

Autor: Huang, Irene, Lin, Wei, Mirza, M. Jehanzeb, Hansen, Jacob A., Doveh, Sivan, Butoi, Victor Ion, Herzig, Roei, Arbelle, Assaf, Kuhene, Hilde, Darrel, Trevor, Gan, Chuang, Oliva, Aude, Feris, Rogerio, Karlinsky, Leonid

Compositional Reasoning (CR) entails grasping the significance of attributes, relations, and word order. Recent Vision-Language Models (VLMs), comprising a visual encoder and a Large Language Model (LLM) decoder, have demonstrated remarkable proficie

Externí odkaz: http://arxiv.org/abs/2406.08164

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání