Zobrazeno 1 - 10
of 362
pro vyhledávání: '"Sun, Xiaoshuai"'
The issue of hallucinations is a prevalent concern in existing Large Vision-Language Models (LVLMs). Previous efforts have primarily focused on investigating object hallucinations, which can be easily alleviated by introducing object detectors. Howev
Externí odkaz:
http://arxiv.org/abs/2406.16449
Autor:
Qian, Zhipeng, Zhang, Pei, Yang, Baosong, Fan, Kai, Ma, Yiwei, Wong, Derek F., Sun, Xiaoshuai, Ji, Rongrong
This paper introduces AnyTrans, an all-encompassing framework for the task-Translate AnyText in the Image (TATI), which includes multilingual text translation and text fusion within images. Our framework leverages the strengths of large-scale models,
Externí odkaz:
http://arxiv.org/abs/2406.11432
Text-based person retrieval (TPR) is a challenging task that involves retrieving a specific individual based on a textual description. Despite considerable efforts to bridge the gap between vision and language, the significant differences between the
Externí odkaz:
http://arxiv.org/abs/2406.05620
In this paper, we introduce SemiRES, a semi-supervised framework that effectively leverages a combination of labeled and unlabeled data to perform RES. A significant hurdle in applying semi-supervised techniques to RES is the prevalence of noisy pseu
Externí odkaz:
http://arxiv.org/abs/2406.01451
This paper explores a novel dynamic network for vision and language tasks, where the inferring structure is customized on the fly for different inputs. Most previous state-of-the-art approaches are static and hand-crafted networks, which not only hea
Externí odkaz:
http://arxiv.org/abs/2406.00334
Recent advancements in automatic 3D avatar generation guided by text have made significant progress. However, existing methods have limitations such as oversaturation and low-quality output. To address these challenges, we propose X-Oscar, a progress
Externí odkaz:
http://arxiv.org/abs/2405.00954
Recently, Segment Anything Model (SAM) has become a research hotspot in the fields of multimedia and computer vision, which exhibits powerful yet versatile capabilities on various (un) conditional image segmentation tasks. Although SAM can support di
Externí odkaz:
http://arxiv.org/abs/2404.00650
Autor:
Chen, Zhongxi, Sun, Ke, Zhou, Ziyin, Lin, Xianming, Sun, Xiaoshuai, Cao, Liujuan, Ji, Rongrong
The rapid progress in deep learning has given rise to hyper-realistic facial forgery methods, leading to concerns related to misinformation and security risks. Existing face forgery datasets have limitations in generating high-quality facial images a
Externí odkaz:
http://arxiv.org/abs/2403.18471
In this paper, we propose a novel parameter and computation efficient tuning method for Multi-modal Large Language Models (MLLMs), termed Efficient Attention Skipping (EAS). Concretely, we first reveal that multi-head attentions (MHAs), the main comp
Externí odkaz:
http://arxiv.org/abs/2403.15226
Autor:
Zhang, Jinlu, Zhou, Yiyi, Zheng, Qiancheng, Du, Xiaoxiong, Luo, Gen, Peng, Jun, Sun, Xiaoshuai, Ji, Rongrong
Text-to-3D-aware face (T3D Face) generation and manipulation is an emerging research hot spot in machine learning, which still suffers from low efficiency and poor quality. In this paper, we propose an End-to-End Efficient and Effective network for f
Externí odkaz:
http://arxiv.org/abs/2403.06702