Výsledky vyhledávání

Report

DataEnvGym: Data Generation Agents in Teacher Environments with Student Feedback

Autor: Khan, Zaid, Stengel-Eskin, Elias, Cho, Jaemin, Bansal, Mohit

The process of creating training data to teach models is currently driven by humans, who manually analyze model weaknesses and plan how to create data that improves a student model. Recent approaches using LLMs as annotators reduce human effort, but

Externí odkaz: http://arxiv.org/abs/2410.06215

Zobrazit plný text záznamu

Report

DOCCI: Descriptions of Connected and Contrasting Images

Autor: Onoe, Yasumasa, Rane, Sunayana, Berger, Zachary, Bitton, Yonatan, Cho, Jaemin, Garg, Roopal, Ku, Alexander, Parekh, Zarana, Pont-Tuset, Jordi, Tanzer, Garrett, Wang, Su, Baldridge, Jason

Vision-language datasets are vital for both text-to-image (T2I) and image-to-text (I2T) research. However, current datasets lack descriptions with fine-grained detail that would allow for richer associations to be learned by models. To fill the gap,

Externí odkaz: http://arxiv.org/abs/2404.19753

Zobrazit plný text záznamu

Report

Ctrl-Adapter: An Efficient and Versatile Framework for Adapting Diverse Controls to Any Diffusion Model

Autor: Lin, Han, Cho, Jaemin, Zala, Abhay, Bansal, Mohit

ControlNets are widely used for adding spatial control to text-to-image diffusion models with different conditions, such as depth maps, scribbles/sketches, and human poses. However, when it comes to controllable video generation, ControlNets cannot b

Externí odkaz: http://arxiv.org/abs/2404.09967

Zobrazit plný text záznamu

Report

Rethinking Interactive Image Segmentation with Low Latency, High Quality, and Diverse Prompts

Autor: Liu, Qin, Cho, Jaemin, Bansal, Mohit, Niethammer, Marc

The goal of interactive image segmentation is to delineate specific regions within an image via visual or language prompts. Low-latency and high-quality interactive segmentation with diverse prompts remain challenging for existing specialist and gene

Externí odkaz: http://arxiv.org/abs/2404.00741

Zobrazit plný text záznamu

Report

EnvGen: Generating and Adapting Environments via LLMs for Training Embodied Agents

Autor: Zala, Abhay, Cho, Jaemin, Lin, Han, Yoon, Jaehong, Bansal, Mohit

Recent SOTA approaches for embodied learning via interaction directly employ large language models (LLMs) as agents to determine the next steps in an environment. Due to their world knowledge and reasoning capabilities, LLM agents achieve stronger pe

Externí odkaz: http://arxiv.org/abs/2403.12014

Zobrazit plný text záznamu

Report

SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with Auto-Generated Data

Autor: Li, Jialu, Cho, Jaemin, Sung, Yi-Lin, Yoon, Jaehong, Bansal, Mohit

Recent text-to-image (T2I) generation models have demonstrated impressive capabilities in creating images from text descriptions. However, these T2I generation models often fall short of generating images that precisely match the details of the text

Externí odkaz: http://arxiv.org/abs/2403.06952

Zobrazit plný text záznamu

Report

Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training

Autor: Wan, David, Cho, Jaemin, Stengel-Eskin, Elias, Bansal, Mohit

Highlighting particularly relevant regions of an image can improve the performance of vision-language models (VLMs) on various vision-language (VL) tasks by guiding the model to attend more closely to these regions of interest. For example, VLMs can

Externí odkaz: http://arxiv.org/abs/2403.02325

Zobrazit plný text záznamu

Report

Davidsonian Scene Graph: Improving Reliability in Fine-grained Evaluation for Text-to-Image Generation

Autor: Cho, Jaemin, Hu, Yushi, Garg, Roopal, Anderson, Peter, Krishna, Ranjay, Baldridge, Jason, Bansal, Mohit, Pont-Tuset, Jordi, Wang, Su

Evaluating text-to-image models is notoriously difficult. A strong recent approach for assessing text-image faithfulness is based on QG/A (question generation and answering), which uses pre-trained foundational models to automatically generate a set

Externí odkaz: http://arxiv.org/abs/2310.18235

Zobrazit plný text záznamu

Report

DiagrammerGPT: Generating Open-Domain, Open-Platform Diagrams via LLM Planning

Autor: Zala, Abhay, Lin, Han, Cho, Jaemin, Bansal, Mohit

Text-to-image (T2I) generation has seen significant growth over the past few years. Despite this, there has been little work on generating diagrams with T2I models. A diagram is a symbolic/schematic representation that explains information using stru

Externí odkaz: http://arxiv.org/abs/2310.12128

Zobrazit plný text záznamu

Report

VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning

Autor: Lin, Han, Zala, Abhay, Cho, Jaemin, Bansal, Mohit

Recent text-to-video (T2V) generation methods have seen significant advancements. However, the majority of these works focus on producing short video clips of a single event (i.e., single-scene videos). Meanwhile, recent large language models (LLMs)

Externí odkaz: http://arxiv.org/abs/2309.15091

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání