Zobrazeno 1 - 10
of 21
pro vyhledávání: '"Qu, Leigang"'
Autor:
Qu, Leigang, Li, Haochuan, Wang, Wenjie, Liu, Xiang, Li, Juncheng, Nie, Liqiang, Chua, Tat-Seng
Large Multimodal Models (LMMs) have demonstrated impressive capabilities in multimodal understanding and generation, pushing forward advancements in text-to-image generation. However, achieving accurate text-image alignment for LMMs, particularly in
Externí odkaz:
http://arxiv.org/abs/2412.05818
Autor:
Li, Yongqi, Cai, Hongru, Wang, Wenjie, Qu, Leigang, Wei, Yinwei, Li, Wenjie, Nie, Liqiang, Chua, Tat-Seng
Text-to-image retrieval is a fundamental task in multimedia processing, aiming to retrieve semantically relevant cross-modal content. Traditional studies have typically approached this task as a discriminative problem, matching the text and image via
Externí odkaz:
http://arxiv.org/abs/2407.17274
How humans can efficiently and effectively acquire images has always been a perennial question. A typical solution is text-to-image retrieval from an existing database given the text query; however, the limited database typically lacks creativity. By
Externí odkaz:
http://arxiv.org/abs/2406.05814
Autor:
Nguyen, Thong, Bin, Yi, Xiao, Junbin, Qu, Leigang, Li, Yicong, Wu, Jay Zhangjie, Nguyen, Cong-Duy, Ng, See-Kiong, Tuan, Luu Anh
Humans use multiple senses to comprehend the environment. Vision and language are two of the most vital senses since they allow us to easily communicate our thoughts and perceive the world around us. There has been a lot of interest in creating video
Externí odkaz:
http://arxiv.org/abs/2406.05615
Despite advancements in text-to-image generation (T2I), prior methods often face text-image misalignment problems such as relation confusion in generated images. Existing solutions involve cross-attention manipulation for better compositional underst
Externí odkaz:
http://arxiv.org/abs/2403.04321
The recent advancements in generative language models have demonstrated their ability to memorize knowledge from documents and recall knowledge to respond to user queries effectively. Building upon this capability, we propose to enable multimodal lar
Externí odkaz:
http://arxiv.org/abs/2402.10805
While recently Multimodal Large Language Models (MM-LLMs) have made exciting strides, they mostly fall prey to the limitation of only input-side multimodal understanding, without the ability to produce content in multiple modalities. As we humans alw
Externí odkaz:
http://arxiv.org/abs/2309.05519
In the text-to-image generation field, recent remarkable progress in Stable Diffusion makes it possible to generate rich kinds of novel photorealistic images. However, current models still face misalignment issues (e.g., problematic spatial relation
Externí odkaz:
http://arxiv.org/abs/2308.05095
Publikováno v:
Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2023)
Image-text retrieval aims to bridge the modality gap and retrieve cross-modal content based on semantic similarities. Prior work usually focuses on the pairwise relations (i.e., whether a data sample matches another) but ignores the higher-order neig
Externí odkaz:
http://arxiv.org/abs/2304.12570
We investigate composed image retrieval with text feedback. Users gradually look for the target of interest by moving from coarse to fine-grained feedback. However, existing methods merely focus on the latter, i.e., fine-grained search, by harnessing
Externí odkaz:
http://arxiv.org/abs/2211.07394