Zobrazeno 1 - 10
of 56
pro vyhledávání: '"Wang, Yueze"'
Autor:
Wang, Xinlong, Zhang, Xiaosong, Luo, Zhengxiong, Sun, Quan, Cui, Yufeng, Wang, Jinsheng, Zhang, Fan, Wang, Yueze, Li, Zhen, Yu, Qiying, Zhao, Yingli, Ao, Yulong, Min, Xuebin, Li, Tao, Wu, Boya, Zhao, Bo, Zhang, Bowen, Wang, Liangdong, Liu, Guang, He, Zheqi, Yang, Xi, Liu, Jingjing, Lin, Yonghua, Huang, Tiejun, Wang, Zhongyuan
While next-token prediction is considered a promising path towards artificial general intelligence, it has struggled to excel in multimodal tasks, which are still dominated by diffusion models (e.g., Stable Diffusion) and compositional approaches (e.
Externí odkaz:
http://arxiv.org/abs/2409.18869
Autor:
Xiao, Shitao, Wang, Yueze, Zhou, Junjie, Yuan, Huaying, Xing, Xingrun, Yan, Ruiran, Li, Chaofan, Wang, Shuting, Huang, Tiejun, Liu, Zheng
The emergence of Large Language Models (LLMs) has unified language generation tasks and revolutionized human-machine interaction. However, in the realm of image generation, a unified model capable of handling various tasks within a single framework r
Externí odkaz:
http://arxiv.org/abs/2409.11340
Existing Multimodal Large Language Models (MLLMs) increasingly emphasize complex understanding of various visual elements, including multiple objects, text information, and spatial relations. Their development for comprehensive visual perception hing
Externí odkaz:
http://arxiv.org/abs/2407.08303
Existing vision-language models (VLMs) mostly rely on vision encoders to extract visual features followed by large language models (LLMs) for visual-language tasks. However, the vision encoders set a strong inductive bias in abstracting visual repres
Externí odkaz:
http://arxiv.org/abs/2406.11832
Multimodal Large Language Models (MLLMs) have exhibited impressive capabilities in visual understanding and reasoning, providing sightly reasonable answers, such as image descriptions. This has spurred extensive research on the evaluation of MLLMs. M
Externí odkaz:
http://arxiv.org/abs/2406.10638
Multimodal Large Language Models (MLLMs) have demonstrated notable capabilities in general visual understanding and reasoning tasks. However, their deployment is hindered by substantial computational costs in both training and inference, limiting acc
Externí odkaz:
http://arxiv.org/abs/2402.11530
Text-to-Image (T2I) models have shown great performance in generating images based on textual prompts. However, these models are vulnerable to unsafe input to generate unsafe content like sexual, harassment and illegal-activity images. Existing studi
Externí odkaz:
http://arxiv.org/abs/2402.10882
Autor:
Sun, Quan, Cui, Yufeng, Zhang, Xiaosong, Zhang, Fan, Yu, Qiying, Luo, Zhengxiong, Wang, Yueze, Rao, Yongming, Liu, Jingjing, Huang, Tiejun, Wang, Xinlong
The human ability to easily solve multimodal tasks in context (i.e., with only a few demonstrations or simple instructions), is what current multimodal systems have largely struggled to imitate. In this work, we demonstrate that the task-agnostic in-
Externí odkaz:
http://arxiv.org/abs/2312.13286
Autor:
Sun, Quan, Yu, Qiying, Cui, Yufeng, Zhang, Fan, Zhang, Xiaosong, Wang, Yueze, Gao, Hongcheng, Liu, Jingjing, Huang, Tiejun, Wang, Xinlong
We present Emu, a Transformer-based multimodal foundation model, which can seamlessly generate images and texts in multimodal context. This omnivore model can take in any single-modality or multimodal data input indiscriminately (e.g., interleaved im
Externí odkaz:
http://arxiv.org/abs/2307.05222
Vision-Language Models (VLMs), such as CLIP, have demonstrated impressive zero-shot transfer capabilities in image-level visual perception. However, these models have shown limited performance in instance-level tasks that demand precise localization
Externí odkaz:
http://arxiv.org/abs/2306.04356