Zobrazeno 1 - 10
of 138
pro vyhledávání: '"Yao, Linli"'
Autor:
Li, Lei, Liu, Yuanxin, Yao, Linli, Zhang, Peiyuan, An, Chenxin, Wang, Lean, Sun, Xu, Kong, Lingpeng, Liu, Qi
Video Large Language Models (Video LLMs) have shown promising capabilities in video comprehension, yet they struggle with tracking temporal changes and reasoning about temporal relationships. While previous research attributed this limitation to the
Externí odkaz:
http://arxiv.org/abs/2410.06166
Publikováno v:
Proceedings of the 2024 International Conference on Multimedia Retrieval, May 2024, Pages 1034-1042
With the surge in the amount of video data, video summarization techniques, including visual-modal(VM) and textual-modal(TM) summarization, are attracting more and more attention. However, unimodal summarization inevitably loses the rich semantics of
Externí odkaz:
http://arxiv.org/abs/2406.16301
The visual projector, which bridges the vision and language modalities and facilitates cross-modal alignment, serves as a crucial component in MLLMs. However, measuring the effectiveness of projectors in vision-language alignment remains under-explor
Externí odkaz:
http://arxiv.org/abs/2405.20985
Autor:
Wang, Yuchi, Ren, Shuhuai, Gao, Rundong, Yao, Linli, Guo, Qingyan, An, Kaikai, Bai, Jianhong, Sun, Xu
Diffusion models have exhibited remarkable capabilities in text-to-image generation. However, their performance in image-to-text generation, specifically image captioning, has lagged behind Auto-Regressive (AR) models, casting doubt on their applicab
Externí odkaz:
http://arxiv.org/abs/2404.10763
This work proposes TimeChat, a time-sensitive multimodal large language model specifically designed for long video understanding. Our model incorporates two key architectural contributions: (1) a timestamp-aware frame encoder that binds visual conten
Externí odkaz:
http://arxiv.org/abs/2312.02051
Autor:
Yao, Linli, Zhang, Yuanmeng, Wang, Ziheng, Hou, Xinglin, Ge, Tiezheng, Jiang, Yuning, Sun, Xu, Jin, Qin
Automatically narrating videos in natural language complying with user requests, i.e. Controllable Video Captioning task, can help people manage massive videos with desired intentions. However, existing works suffer from two shortcomings: 1) the cont
Externí odkaz:
http://arxiv.org/abs/2305.08389
Image-text retrieval, as a fundamental and important branch of information retrieval, has attracted extensive research attentions. The main challenge of this task is cross-modal semantic understanding and matching. Some recent works focus more on fin
Externí odkaz:
http://arxiv.org/abs/2304.10824
Automatically generating textual descriptions for massive unlabeled images on the web can greatly benefit realistic web applications, e.g. multimodal retrieval and recommendation. However, existing models suffer from the problem of generating ``over-
Externí odkaz:
http://arxiv.org/abs/2211.09371
The Image Difference Captioning (IDC) task aims to describe the visual differences between two similar images with natural language. The major challenges of this task lie in two aspects: 1) fine-grained visual differences that require learning strong
Externí odkaz:
http://arxiv.org/abs/2202.04298
Autor:
Duan, Zonghao, Yang, Minwei, Yang, Jian, Wu, Zheng, Zhu, Yuheng, Jia, Qinyuan, Ma, Xueshiyu, Yin, Yifan, Zheng, Jiahao, Yang, Jianyu, Jiang, Shuheng, Hu, Lipeng, Zhang, Junfeng, Liu, Dejun, Huo, Yanmiao, Yao, Linli, Sun, Yongwei
Publikováno v:
In Cancer Letters 28 August 2024 598