Zobrazeno 1 - 10
of 44
pro vyhledávání: '"Guo, Longteng"'
Autor:
Zhao, Zijia, Guo, Longteng, Yue, Tongtian, Hu, Erdong, Shao, Shuai, Yuan, Zehuan, Huang, Hua, Liu, Jing
In this paper, we investigate the task of general conversational image retrieval on open-domain images. The objective is to search for images based on interactive conversations between humans and computers. To advance this task, we curate a dataset c
Externí odkaz:
http://arxiv.org/abs/2410.18715
Electroencephalogram (EEG) signals are pivotal in providing insights into spontaneous brain activity, highlighting their significant importance in neuroscience research. However, the exploration of versatile EEG models is constrained by diverse data
Externí odkaz:
http://arxiv.org/abs/2410.19779
In the era of Large Language Models (LLMs), Mixture-of-Experts (MoE) architectures offer a promising approach to managing computational costs while scaling up model parameters. Conventional MoE-based LLMs typically employ static Top-K routing, which
Externí odkaz:
http://arxiv.org/abs/2410.10456
Autor:
Sun, Mingzhen, Wang, Weining, Qiao, Yanyuan, Sun, Jiahui, Qin, Zihan, Guo, Longteng, Zhu, Xinxin, Liu, Jing
Sounding Video Generation (SVG) is an audio-video joint generation task challenged by high-dimensional signal spaces, distinct data formats, and different patterns of content information. To address these issues, we introduce a novel multi-modal late
Externí odkaz:
http://arxiv.org/abs/2410.01594
In computer vision, Image Difference Captioning (IDC) is crucial for accurately describing variations between closely related images. Traditional IDC methods often rely on specialist models, which restrict their applicability across varied contexts.
Externí odkaz:
http://arxiv.org/abs/2407.05645
Autor:
Zhao, Zijia, Lu, Haoyu, Huo, Yuqi, Du, Yifan, Yue, Tongtian, Guo, Longteng, Wang, Bingning, Chen, Weipeng, Liu, Jing
Video understanding is a crucial next step for multimodal large language models (MLLMs). Various benchmarks are introduced for better evaluating the MLLMs. Nevertheless, current video benchmarks are still inefficient for evaluating video models durin
Externí odkaz:
http://arxiv.org/abs/2406.09367
While large visual-language models (LVLM) have shown promising results on traditional visual question answering benchmarks, it is still challenging for them to answer complex VQA problems which requires diverse world knowledge. Motivated by the resea
Externí odkaz:
http://arxiv.org/abs/2404.13947
Autor:
Qiao, Yanyuan, Yu, Zheng, Guo, Longteng, Chen, Sihan, Zhao, Zijia, Sun, Mingzhen, Wu, Qi, Liu, Jing
Multimodal large language models (MLLMs) have attracted widespread interest and have rich applications. However, the inherent attention mechanism in its Transformer structure requires quadratic complexity and results in expensive computational overhe
Externí odkaz:
http://arxiv.org/abs/2403.13600
Autor:
Yue, Tongtian, Cheng, Jie, Guo, Longteng, Dai, Xingyuan, Zhao, Zijia, He, Xingjian, Xiong, Gang, Lv, Yisheng, Liu, Jing
Recent trends in Large Vision Language Models (LVLMs) research have been increasingly focusing on advancing beyond general image understanding towards more nuanced, object-level referential comprehension. In this paper, we present and delve into the
Externí odkaz:
http://arxiv.org/abs/2403.13263
Autor:
Hao, Dongze, Jia, Jian, Guo, Longteng, Wang, Qunbo, Yang, Te, Li, Yan, Cheng, Yanhua, Wang, Bo, Chen, Quan, Li, Han, Liu, Jing
Knowledge-based visual question answering (KB-VQA) is a challenging task, which requires the model to leverage external knowledge for comprehending and answering questions grounded in visual content. Recent studies retrieve the knowledge passages fro
Externí odkaz:
http://arxiv.org/abs/2403.10037