Zobrazeno 1 - 10
of 75
pro vyhledávání: '"Zhang, Yuechen"'
Diffusion models have demonstrated remarkable and robust abilities in both image and video generation. To achieve greater control over generated results, researchers introduce additional architectures, such as ControlNet, Adapters and ReferenceNet, t
Externí odkaz:
http://arxiv.org/abs/2408.06070
Diffusion models excel at producing high-quality images; however, scaling to higher resolutions, such as 4K, often results in over-smoothed content, structural distortions, and repetitive patterns. To this end, we introduce ResMaster, a novel, traini
Externí odkaz:
http://arxiv.org/abs/2406.16476
Autor:
Li, Yanwei, Zhang, Yuechen, Wang, Chengyao, Zhong, Zhisheng, Chen, Yixin, Chu, Ruihang, Liu, Shaoteng, Jia, Jiaya
In this work, we introduce Mini-Gemini, a simple and effective framework enhancing multi-modality Vision Language Models (VLMs). Despite the advancements in VLMs facilitating basic visual dialog and reasoning, a performance gap persists compared to a
Externí odkaz:
http://arxiv.org/abs/2403.18814
This study targets a critical aspect of multi-modal LLMs' (LLMs&VLMs) inference: explicit controllable text generation. Multi-modal LLMs empower multi-modality understanding with the capability of semantic generation yet bring less explainability and
Externí odkaz:
http://arxiv.org/abs/2312.04302
Autor:
Xing, Jinbo, Xia, Menghan, Liu, Yuxin, Zhang, Yuechen, Zhang, Yong, He, Yingqing, Liu, Hanyuan, Chen, Haoxin, Cun, Xiaodong, Wang, Xintao, Shan, Ying, Wong, Tien-Tsin
Creating a vivid video from the event or scenario in our imagination is a truly fascinating experience. Recent advancements in text-to-video synthesis have unveiled the potential to achieve this with prompts only. While text is convenient in conveyin
Externí odkaz:
http://arxiv.org/abs/2306.00943
Recent diffusion model advancements have enabled high-fidelity images to be generated using text prompts. However, a domain gap exists between generated images and real-world images, which poses a challenge in generating high-quality variations of re
Externí odkaz:
http://arxiv.org/abs/2305.18729
This paper presents Video-P2P, a novel framework for real-world video editing with cross-attention control. While attention control has proven effective for image editing with pre-trained image generation models, there are currently no large-scale vi
Externí odkaz:
http://arxiv.org/abs/2303.04761
Speech-driven 3D facial animation has been widely studied, yet there is still a gap to achieving realism and vividness due to the highly ill-posed nature and scarcity of audio-visual data. Existing works typically formulate the cross-modal mapping in
Externí odkaz:
http://arxiv.org/abs/2301.02379
Current 3D scene stylization methods transfer textures and colors as styles using arbitrary style references, lacking meaningful semantic correspondences. We introduce Reference-Based Non-Photorealistic Radiance Fields (Ref-NPR) to address this limit
Externí odkaz:
http://arxiv.org/abs/2212.02766
Autor:
Shen, Tiancheng, Zhang, Yuechen, Qi, Lu, Kuen, Jason, Xie, Xingyu, Wu, Jianlong, Lin, Zhe, Jia, Jiaya
To segment 4K or 6K ultra high-resolution images needs extra computation consideration in image segmentation. Common strategies, such as down-sampling, patch cropping, and cascade model, cannot address well the balance issue between accuracy and comp
Externí odkaz:
http://arxiv.org/abs/2111.14482