Zobrazeno 1 - 10
of 34
pro vyhledávání: '"Zeng, Ziyun"'
Autor:
Hua, Hang, Tang, Yunlong, Zeng, Ziyun, Cao, Liangliang, Yang, Zhengyuan, He, Hangfeng, Xu, Chenliang, Luo, Jiebo
The advent of large Vision-Language Models (VLMs) has significantly advanced multimodal understanding, enabling more sophisticated and accurate integration of visual and textual information across various tasks, including image and video captioning,
Externí odkaz:
http://arxiv.org/abs/2410.09733
Diffusion models equipped with language models demonstrate excellent controllability in image generation tasks, allowing image processing to adhere to human instructions. However, the lack of diverse instruction-following data hampers the development
Externí odkaz:
http://arxiv.org/abs/2405.16785
GMMFormer: Gaussian-Mixture-Model Based Transformer for Efficient Partially Relevant Video Retrieval
Given a text query, partially relevant video retrieval (PRVR) seeks to find untrimmed videos containing pertinent moments in a database. For PRVR, clip modeling is essential to capture the partial relationship between texts and videos. Current PRVR m
Externí odkaz:
http://arxiv.org/abs/2310.05195
The great success of Large Language Models (LLMs) has expanded the potential of multimodality, contributing to the gradual evolution of General Artificial Intelligence (AGI). A true AGI agent should not only possess the capability to perform predefin
Externí odkaz:
http://arxiv.org/abs/2310.01218
Existing approaches to unsupervised video instance segmentation typically rely on motion estimates and experience difficulties tracking small or divergent motions. We present VideoCutLER, a simple method for unsupervised multi-instance video segmenta
Externí odkaz:
http://arxiv.org/abs/2308.14710
Autor:
Wang, Jinpeng, Zeng, Ziyun, Wang, Yunxiao, Wang, Yuting, Lu, Xingyu, Li, Tianxiang, Yuan, Jun, Zhang, Rui, Zheng, Hai-Tao, Xia, Shu-Tao
The goal of sequential recommendation (SR) is to predict a user's potential interested items based on her/his historical interaction sequences. Most existing sequential recommenders are developed based on ID features, which, despite their widespread
Externí odkaz:
http://arxiv.org/abs/2308.11175
We present SEED, an elaborate image tokenizer that empowers Large Language Models (LLMs) with the emergent ability to SEE and Draw at the same time. Research on image tokenizers has previously reached an impasse, as frameworks employing quantized vis
Externí odkaz:
http://arxiv.org/abs/2307.08041
The ultimate goal for foundation models is realizing task-agnostic, i.e., supporting out-of-the-box usage without task-specific fine-tuning. Although breakthroughs have been made in natural language processing and image representation learning, it is
Externí odkaz:
http://arxiv.org/abs/2305.14173
Self-Supervised Video Hashing (SSVH) models learn to generate short binary representations for videos without ground-truth supervision, facilitating large-scale video retrieval efficiency and attracting increasing research attention. The success of S
Externí odkaz:
http://arxiv.org/abs/2211.11210
Pre-training on large-scale video data has become a common recipe for learning transferable spatiotemporal representations in recent years. Despite some progress, existing methods are mostly limited to highly curated datasets (e.g., K400) and exhibit
Externí odkaz:
http://arxiv.org/abs/2209.15280