Zobrazeno 1 - 10
of 116
pro vyhledávání: '"Cho, Jaemin"'
The process of creating training data to teach models is currently driven by humans, who manually analyze model weaknesses and plan how to create data that improves a student model. Recent approaches using LLMs as annotators reduce human effort, but
Externí odkaz:
http://arxiv.org/abs/2410.06215
Autor:
Onoe, Yasumasa, Rane, Sunayana, Berger, Zachary, Bitton, Yonatan, Cho, Jaemin, Garg, Roopal, Ku, Alexander, Parekh, Zarana, Pont-Tuset, Jordi, Tanzer, Garrett, Wang, Su, Baldridge, Jason
Vision-language datasets are vital for both text-to-image (T2I) and image-to-text (I2T) research. However, current datasets lack descriptions with fine-grained detail that would allow for richer associations to be learned by models. To fill the gap,
Externí odkaz:
http://arxiv.org/abs/2404.19753
ControlNets are widely used for adding spatial control to text-to-image diffusion models with different conditions, such as depth maps, scribbles/sketches, and human poses. However, when it comes to controllable video generation, ControlNets cannot b
Externí odkaz:
http://arxiv.org/abs/2404.09967
The goal of interactive image segmentation is to delineate specific regions within an image via visual or language prompts. Low-latency and high-quality interactive segmentation with diverse prompts remain challenging for existing specialist and gene
Externí odkaz:
http://arxiv.org/abs/2404.00741
Recent SOTA approaches for embodied learning via interaction directly employ large language models (LLMs) as agents to determine the next steps in an environment. Due to their world knowledge and reasoning capabilities, LLM agents achieve stronger pe
Externí odkaz:
http://arxiv.org/abs/2403.12014
Recent text-to-image (T2I) generation models have demonstrated impressive capabilities in creating images from text descriptions. However, these T2I generation models often fall short of generating images that precisely match the details of the text
Externí odkaz:
http://arxiv.org/abs/2403.06952
Highlighting particularly relevant regions of an image can improve the performance of vision-language models (VLMs) on various vision-language (VL) tasks by guiding the model to attend more closely to these regions of interest. For example, VLMs can
Externí odkaz:
http://arxiv.org/abs/2403.02325
Autor:
Cho, Jaemin, Hu, Yushi, Garg, Roopal, Anderson, Peter, Krishna, Ranjay, Baldridge, Jason, Bansal, Mohit, Pont-Tuset, Jordi, Wang, Su
Evaluating text-to-image models is notoriously difficult. A strong recent approach for assessing text-image faithfulness is based on QG/A (question generation and answering), which uses pre-trained foundational models to automatically generate a set
Externí odkaz:
http://arxiv.org/abs/2310.18235
Text-to-image (T2I) generation has seen significant growth over the past few years. Despite this, there has been little work on generating diagrams with T2I models. A diagram is a symbolic/schematic representation that explains information using stru
Externí odkaz:
http://arxiv.org/abs/2310.12128
Recent text-to-video (T2V) generation methods have seen significant advancements. However, the majority of these works focus on producing short video clips of a single event (i.e., single-scene videos). Meanwhile, recent large language models (LLMs)
Externí odkaz:
http://arxiv.org/abs/2309.15091