Zobrazeno 1 - 10
of 50
pro vyhledávání: '"Sun, Peize"'
Classifier-Free Guidance (CFG) is a critical technique for enhancing the sample quality of visual generative models. However, in autoregressive (AR) multi-modal generation, CFG introduces design inconsistencies between language and visual content, co
Externí odkaz:
http://arxiv.org/abs/2410.09347
Autor:
Li, Zongming, Cheng, Tianheng, Chen, Shoufa, Sun, Peize, Shen, Haocheng, Ran, Longjin, Chen, Xiaoxin, Liu, Wenyu, Wang, Xinggang
Autoregressive (AR) models have reformulated image generation as next-token prediction, demonstrating remarkable potential and emerging as strong competitors to diffusion models. However, control-to-image generation, akin to ControlNet, remains large
Externí odkaz:
http://arxiv.org/abs/2410.02705
Autor:
Ji, Yatai, Zhang, Shilong, Wu, Jie, Sun, Peize, Chen, Weifeng, Xiao, Xuefeng, Yang, Sidi, Yang, Yujiu, Luo, Ping
The rapid advancement of Large Vision-Language models (LVLMs) has demonstrated a spectrum of emergent capabilities. Nevertheless, current models only focus on the visual content of a single scenario, while their ability to associate instances across
Externí odkaz:
http://arxiv.org/abs/2407.07577
We introduce LlamaGen, a new family of image generation models that apply original ``next-token prediction'' paradigm of large language models to visual generation domain. It is an affirmative answer to whether vanilla autoregressive models, e.g., Ll
Externí odkaz:
http://arxiv.org/abs/2406.06525
Autor:
Mu, Yao, Chen, Junting, Zhang, Qinglong, Chen, Shoufa, Yu, Qiaojun, Ge, Chongjian, Chen, Runjian, Liang, Zhixuan, Hu, Mengkang, Tao, Chaofan, Sun, Peize, Yu, Haibao, Yang, Chao, Shao, Wenqi, Wang, Wenhai, Dai, Jifeng, Qiao, Yu, Ding, Mingyu, Luo, Ping
Robotic behavior synthesis, the problem of understanding multimodal inputs and generating precise physical control for robots, is an important part of Embodied AI. Despite successes in applying multimodal large language models for high-level understa
Externí odkaz:
http://arxiv.org/abs/2402.16117
We present a conceptually simple, efficient, and general framework for localization problems in DETR-like models. We add plugins to well-trained models instead of inefficiently designing new models and training them from scratch. The method, called R
Externí odkaz:
http://arxiv.org/abs/2307.11828
Autor:
Li, Feng, Zhang, Hao, Sun, Peize, Zou, Xueyan, Liu, Shilong, Yang, Jianwei, Li, Chunyuan, Zhang, Lei, Gao, Jianfeng
In this paper, we introduce Semantic-SAM, a universal image segmentation model to enable segment and recognize anything at any desired granularity. Our model offers two key advantages: semantic-awareness and granularity-abundance. To achieve semantic
Externí odkaz:
http://arxiv.org/abs/2307.04767
Autor:
Zhang, Shilong, Sun, Peize, Chen, Shoufa, Xiao, Min, Shao, Wenqi, Zhang, Wenwei, Liu, Yu, Chen, Kai, Luo, Ping
Visual instruction tuning large language model(LLM) on image-text pairs has achieved general-purpose vision-language abilities. However, the lack of region-text pairs limits their advancements to fine-grained multimodal understanding. In this paper,
Externí odkaz:
http://arxiv.org/abs/2307.03601
Object detection has been expanded from a limited number of categories to open vocabulary. Moving forward, a complete intelligent vision system requires understanding more fine-grained object descriptions, object parts. In this paper, we propose a de
Externí odkaz:
http://arxiv.org/abs/2305.11173
Autor:
Zhang, Yifu, Wang, Xinggang, Ye, Xiaoqing, Zhang, Wei, Lu, Jincheng, Tan, Xiao, Ding, Errui, Sun, Peize, Wang, Jingdong
Multi-object tracking (MOT) aims at estimating bounding boxes and identities of objects across video frames. Detection boxes serve as the basis of both 2D and 3D MOT. The inevitable changing of detection scores leads to object missing after tracking.
Externí odkaz:
http://arxiv.org/abs/2303.15334