Výsledky vyhledávání - "Chen DongDong"

Report

Olympus: A Universal Task Router for Computer Vision Tasks

Autor: Lin, Yuanze, Li, Yunsheng, Chen, Dongdong, Xu, Weijian, Clark, Ronald, Torr, Philip H. S.

We introduce Olympus, a new approach that transforms Multimodal Large Language Models (MLLMs) into a unified framework capable of handling a wide array of computer vision tasks. Utilizing a controller MLLM, Olympus delegates over 20 specialized tasks

Externí odkaz: http://arxiv.org/abs/2412.09612

Zobrazit plný text záznamu

Report

MageBench: Bridging Large Multimodal Models to Agents

Autor: Zhang, Miaosen, Dai, Qi, Yang, Yifan, Bao, Jianmin, Chen, Dongdong, Qiu, Kai, Luo, Chong, Geng, Xin, Guo, Baining

LMMs have shown impressive visual understanding capabilities, with the potential to be applied in agents, which demand strong reasoning and planning abilities. Nevertheless, existing benchmarks mostly assess their reasoning abilities in language part

Externí odkaz: http://arxiv.org/abs/2412.04531

Zobrazit plný text záznamu

Report

Enhancing MMDiT-Based Text-to-Image Models for Similar Subject Generation

Autor: Wei, Tianyi, Chen, Dongdong, Zhou, Yifan, Pan, Xingang

Representing the cutting-edge technique of text-to-image models, the latest Multimodal Diffusion Transformer (MMDiT) largely mitigates many generation issues existing in previous models. However, we discover that it still suffers from subject neglect

Externí odkaz: http://arxiv.org/abs/2411.18301

Zobrazit plný text záznamu

Report

LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation

Autor: Huang, Weiquan, Wu, Aoqi, Yang, Yifan, Luo, Xufang, Yang, Yuqing, Hu, Liang, Dai, Qi, Dai, Xiyang, Chen, Dongdong, Luo, Chong, Qiu, Lili

CLIP is a foundational multimodal model that aligns image and text features into a shared space using contrastive learning on large-scale image-text pairs. Its strength lies in leveraging natural language as a rich supervisory signal. With the rapid

Externí odkaz: http://arxiv.org/abs/2411.04997

Zobrazit plný text záznamu

Report

SGEdit: Bridging LLM with Text2Image Generative Model for Scene Graph-based Image Editing

Autor: Zhang, Zhiyuan, Chen, DongDong, Liao, Jing

Scene graphs offer a structured, hierarchical representation of images, with nodes and edges symbolizing objects and the relationships among them. It can serve as a natural interface for image editing, dramatically improving precision and flexibility

Externí odkaz: http://arxiv.org/abs/2410.11815

Zobrazit plný text záznamu

Report

ToolBridge: An Open-Source Dataset to Equip LLMs with External Tool Capabilities

Autor: Jin, Zhenchao, Liu, Mengchen, Chen, Dongdong, Zhu, Lingting, Li, Yunsheng, Yu, Lequan

Through the integration of external tools, large language models (LLMs) such as GPT-4o and Llama 3.1 significantly expand their functional capabilities, evolving from elementary conversational agents to general-purpose assistants. We argue that the p

Externí odkaz: http://arxiv.org/abs/2410.10872

Zobrazit plný text záznamu

Report

SynChart: Synthesizing Charts from Language Models

Autor: Liu, Mengchen, Li, Qixiu, Chen, Dongdong, Chen, Dong, Bao, Jianmin, Li, Yunsheng

With the release of GPT-4V(O), its use in generating pseudo labels for multi-modality tasks has gained significant popularity. However, it is still a secret how to build such advanced models from its base large language models (LLMs). This work explo

Externí odkaz: http://arxiv.org/abs/2409.16517

Zobrazit plný text záznamu

Report

Pluralistic Salient Object Detection

Autor: Feng, Xuelu, Li, Yunsheng, Chen, Dongdong, Qiao, Chunming, Yuan, Junsong, Yuan, Lu, Hua, Gang

We introduce pluralistic salient object detection (PSOD), a novel task aimed at generating multiple plausible salient segmentation results for a given input image. Unlike conventional SOD methods that produce a single segmentation mask for salient ob

Externí odkaz: http://arxiv.org/abs/2409.02368

Zobrazit plný text záznamu

Report

Chat2Layout: Interactive 3D Furniture Layout with a Multimodal LLM

Autor: Wang, Can, Zhong, Hongliang, Chai, Menglei, He, Mingming, Chen, Dongdong, Liao, Jing

Automatic furniture layout is long desired for convenient interior design. Leveraging the remarkable visual reasoning capabilities of multimodal large language models (MLLMs), recent methods address layout generation in a static manner, lacking the f

Externí odkaz: http://arxiv.org/abs/2407.21333

Zobrazit plný text záznamu

Report

Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge

Autor: Lin, Yuanze, Li, Yunsheng, Chen, Dongdong, Xu, Weijian, Clark, Ronald, Torr, Philip, Yuan, Lu

In recent years, multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets, enabling them to generally understand images well. However, the inherent difficulty in explicitly conveying

Externí odkaz: http://arxiv.org/abs/2407.04681

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání