Zobrazeno 1 - 10
of 180
pro vyhledávání: '"Lim, Ser-Nam"'
We study the visual semantic embedding problem for image-text matching. Most existing work utilizes a tailored cross-attention mechanism to perform local alignment across the two image and text modalities. This is computationally expensive, even thou
Externí odkaz:
http://arxiv.org/abs/2406.11820
Autor:
Huang, Shuaiyi, Suri, Saksham, Gupta, Kamal, Rambhatla, Sai Saketh, Lim, Ser-nam, Shrivastava, Abhinav
Video instance segmentation requires classifying, segmenting, and tracking every object across video frames. Unlike existing approaches that rely on masks, boxes, or category labels, we propose UVIS, a novel Unsupervised Video Instance Segmentation (
Externí odkaz:
http://arxiv.org/abs/2406.06908
``Learning to hash'' is a practical solution for efficient retrieval, offering fast search speed and low storage cost. It is widely applied in various applications, such as image-text cross-modal search. In this paper, we explore the potential of enh
Externí odkaz:
http://arxiv.org/abs/2405.14726
Autor:
Jang, Young Kyun, Lim, Ser-nam
Modern retrieval systems often struggle with upgrading to new and more powerful models due to the incompatibility of embeddings between the old and new models. This necessitates a costly process known as backfilling, which involves re-computing the e
Externí odkaz:
http://arxiv.org/abs/2405.14715
Composed Image Retrieval (CIR) is a complex task that retrieves images using a query, which is configured with an image and a caption that describes desired modifications to that image. Supervised CIR approaches have shown strong performance, but the
Externí odkaz:
http://arxiv.org/abs/2405.00571
Composed Image Retrieval (CIR) is a task that retrieves images similar to a query, based on a provided textual modification. Current techniques rely on supervised learning for CIR models using labeled triplets of the reference image, text, target ima
Externí odkaz:
http://arxiv.org/abs/2404.15516
Autor:
He, Bo, Li, Hengduo, Jang, Young Kyun, Jia, Menglin, Cao, Xuefei, Shah, Ashish, Shrivastava, Abhinav, Lim, Ser-Nam
With the success of large language models (LLMs), integrating the vision model into LLMs to build vision-language foundation models has gained much more interest recently. However, existing LLM-based large multimodal models (e.g., Video-LLaMA, VideoC
Externí odkaz:
http://arxiv.org/abs/2404.05726
Mitigating hallucinations of Large Vision Language Models,(LVLMs) is crucial to enhance their reliability for general-purpose assistants. This paper shows that such hallucinations of LVLMs can be significantly exacerbated by preceding user-system dia
Externí odkaz:
http://arxiv.org/abs/2403.10492
Novel view synthesis has observed tremendous developments since the arrival of NeRFs. However, Nerf models overfit on a single scene, lacking generalization to out of distribution objects. Recently, diffusion models have exhibited remarkable performa
Externí odkaz:
http://arxiv.org/abs/2403.06394
Autor:
Chiang, Ping-yeh, Zhou, Yipin, Poursaeed, Omid, Shukla, Satya Narayan, Shah, Ashish, Goldstein, Tom, Lim, Ser-Nam
Recently, Pyramid Adversarial training (Herrmann et al., 2022) has been shown to be very effective for improving clean accuracy and distribution-shift robustness of vision transformers. However, due to the iterative nature of adversarial training, th
Externí odkaz:
http://arxiv.org/abs/2312.16339