Zobrazeno 1 - 10
of 347
pro vyhledávání: '"Li Yizhuo"'
Videos are inherently temporal sequences by their very nature. In this work, we explore the potential of modeling videos in a chronological and scalable manner with autoregressive (AR) language models, inspired by their success in natural language pr
Externí odkaz:
http://arxiv.org/abs/2412.04446
Recent developments in Large Language Models pre-trained on extensive corpora have shown significant success in various natural language processing tasks with minimal fine-tuning. This success offers new promise for robotics, which has long been cons
Externí odkaz:
http://arxiv.org/abs/2412.04445
In recent years, there has been a significant surge of interest in unifying image comprehension and generation within Large Language Models (LLMs). This growing interest has prompted us to explore extending this unification to videos. The core challe
Externí odkaz:
http://arxiv.org/abs/2412.04432
Autor:
Li Yizhuo
Publikováno v:
Parse Journal, Vol On the Question of Exhibition Part 3, Iss 13.3 (2021)
Analyzing the exhibition “Archiving the Spaces of Anxiety” as an exemplary response to the OCAT Institute’s call for research-based curatorial projects, this essay compares the curator Chen Shuyu’s proposition of “curatorial spatiality,”
Externí odkaz:
https://doaj.org/article/72d18ebbe1bc4254b8662c6504da9ab9
Autor:
Li, Kunchang, Wang, Yali, He, Yinan, Li, Yizhuo, Wang, Yi, Liu, Yi, Wang, Zun, Xu, Jilan, Chen, Guo, Luo, Ping, Wang, Limin, Qiao, Yu
With the rapid development of Multi-modal Large Language Models (MLLMs), a number of diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities of these models. However, most benchmarks predominantly assess spatial underst
Externí odkaz:
http://arxiv.org/abs/2311.17005
Building video-language foundation models is costly and difficult due to the redundant nature of video data and the lack of high-quality video-language datasets. In this paper, we propose an efficient framework to harvest video foundation models from
Externí odkaz:
http://arxiv.org/abs/2310.19554
Autor:
Wang, Yi, He, Yinan, Li, Yizhuo, Li, Kunchang, Yu, Jiashuo, Ma, Xin, Li, Xinhao, Chen, Guo, Chen, Xinyuan, Wang, Yaohui, He, Conghui, Luo, Ping, Liu, Ziwei, Wang, Yali, Wang, Limin, Qiao, Yu
This paper introduces InternVid, a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations for multimodal understanding and generation. The InternVid dataset contains over 7 million vide
Externí odkaz:
http://arxiv.org/abs/2307.06942
Autor:
Li, KunChang, He, Yinan, Wang, Yi, Li, Yizhuo, Wang, Wenhai, Luo, Ping, Wang, Yali, Wang, Limin, Qiao, Yu
In this paper, we initiate an attempt of developing an end-to-end chat-centric video understanding system, coined as VideoChat. It integrates video foundation models and large language models via a learnable neural interface, excelling in spatiotempo
Externí odkaz:
http://arxiv.org/abs/2305.06355
Video Foundation Models (VFMs) have received limited exploration due to high computational costs and data scarcity. Previous VFMs rely on Image Foundation Models (IFMs), which face challenges in transferring to the video domain. Although VideoMAE has
Externí odkaz:
http://arxiv.org/abs/2303.16058
Autor:
Wang, Yi, Li, Kunchang, Li, Yizhuo, He, Yinan, Huang, Bingkun, Zhao, Zhiyu, Zhang, Hongjie, Xu, Jilan, Liu, Yi, Wang, Zun, Xing, Sen, Chen, Guo, Pan, Junting, Yu, Jiashuo, Wang, Yali, Wang, Limin, Qiao, Yu
The foundation models have recently shown excellent performance on a variety of downstream tasks in computer vision. However, most existing vision foundation models simply focus on image-level pretraining and adpation, which are limited for dynamic a
Externí odkaz:
http://arxiv.org/abs/2212.03191