Zobrazeno 1 - 10
of 477
pro vyhledávání: '"Liu, Luping"'
The rapid advancement of text-to-image (T2I) diffusion models has enabled them to generate unprecedented results from given texts. However, as text inputs become longer, existing encoding methods like CLIP face limitations, and aligning the generated
Externí odkaz:
http://arxiv.org/abs/2410.11817
Autor:
Wang, Zehan, Zhang, Ziang, Zhang, Hang, Liu, Luping, Huang, Rongjie, Cheng, Xize, Zhao, Hengshuang, Zhao, Zhou
Recently, human-computer interaction with various modalities has shown promising applications, like GPT-4o and Gemini. Given the foundational role of multimodal joint representation in understanding and generation pipelines, high-quality omni joint r
Externí odkaz:
http://arxiv.org/abs/2407.11895
Autor:
Wang, Zehan, Zhang, Ziang, Cheng, Xize, Huang, Rongjie, Liu, Luping, Ye, Zhenhui, Huang, Haifeng, Zhao, Yang, Jin, Tao, Gao, Peng, Zhao, Zhou
Unified multi-model representation spaces are the foundation of multimodal understanding and generation. However, the billions of model parameters and catastrophic forgetting problems make it challenging to further enhance pre-trained unified spaces.
Externí odkaz:
http://arxiv.org/abs/2405.04883
Autor:
Huang, Haifeng, Chen, Yilun, Wang, Zehan, Huang, Rongjie, Xu, Runsen, Wang, Tai, Liu, Luping, Cheng, Xize, Zhao, Yang, Pang, Jiangmiao, Zhao, Zhou
Recent advancements in 3D Large Language Models (LLMs) have demonstrated promising capabilities for 3D scene understanding. However, previous methods exhibit deficiencies in general referencing and grounding capabilities for intricate scene comprehen
Externí odkaz:
http://arxiv.org/abs/2312.08168
Multi-modal contrastive representation (MCR) of more than three modalities is critical in multi-modal learning. Although recent methods showcase impressive achievements, the high dependence on large-scale, high-quality paired data and the expensive t
Externí odkaz:
http://arxiv.org/abs/2310.08884
Diffusion models have demonstrated impressive performance in text-to-image generation. They utilize a text encoder and cross-attention blocks to infuse textual information into images at a pixel level. However, their capability to generate images wit
Externí odkaz:
http://arxiv.org/abs/2306.02236
Autor:
Huang, Rongjie, Zhang, Chunlei, Wang, Yongqi, Yang, Dongchao, Liu, Luping, Ye, Zhenhui, Jiang, Ziyue, Weng, Chao, Zhao, Zhou, Yu, Dong
Various applications of voice synthesis have been developed independently despite the fact that they generate "voice" as output in common. In addition, the majority of voice synthesis models currently rely on annotated audio data, but it is crucial t
Externí odkaz:
http://arxiv.org/abs/2305.19269
Diffusion models have recently dominated image synthesis tasks. However, the iterative denoising process is expensive in computations at inference time, making diffusion models less practical for low-latency and scalable real-world applications. Post
Externí odkaz:
http://arxiv.org/abs/2305.10657
Though denoising diffusion probabilistic models (DDPMs) have achieved remarkable generation results, the low sampling efficiency of DDPMs still limits further applications. Since DDPMs can be formulated as diffusion ordinary differential equations (O
Externí odkaz:
http://arxiv.org/abs/2301.12935
Autor:
Huang, Rongjie, Huang, Jiawei, Yang, Dongchao, Ren, Yi, Liu, Luping, Li, Mingze, Ye, Zhenhui, Liu, Jinglin, Yin, Xiang, Zhao, Zhou
Large-scale multimodal generative modeling has created milestones in text-to-image and text-to-video generation. Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and
Externí odkaz:
http://arxiv.org/abs/2301.12661