Zobrazeno 1 - 10
of 127
pro vyhledávání: '"Wen, Longyin"'
Autor:
Xu, Lu, Zhu, Sijie, Li, Chunyuan, Kuo, Chia-Wen, Chen, Fan, Wang, Xinyao, Chen, Guang, Du, Dawei, Yuan, Ye, Wen, Longyin
The emerging video LMMs (Large Multimodal Models) have achieved significant improvements on generic video understanding in the form of VQA (Visual Question Answering), where the raw videos are captured by cameras. However, a large portion of videos i
Externí odkaz:
http://arxiv.org/abs/2406.10484
Autor:
Li, Jiachen, Wang, Xinyao, Zhu, Sijie, Kuo, Chia-Wen, Xu, Lu, Chen, Fan, Jain, Jitesh, Shi, Humphrey, Wen, Longyin
Recent advancements in Multimodal Large Language Models (LLMs) have focused primarily on scaling by increasing text-image pair data and enhancing LLMs to improve performance on multimodal tasks. However, these scaling approaches are computationally e
Externí odkaz:
http://arxiv.org/abs/2405.05949
This paper focuses on understanding the predominant video creation pipeline, i.e., compositional video editing with six main types of editing components, including video effects, animation, transition, filter, sticker, and text. In contrast to existi
Externí odkaz:
http://arxiv.org/abs/2403.16048
Existing video captioning approaches typically require to first sample video frames from a decoded video and then conduct a subsequent process (e.g., feature extraction and/or captioning model learning). In this pipeline, manual frame sampling may ig
Externí odkaz:
http://arxiv.org/abs/2309.12867
Recent focus in video captioning has been on designing architectures that can consume both video and text modalities, and using large-scale video datasets with text transcripts for pre-training, such as HowTo100M. Though these approaches have achieve
Externí odkaz:
http://arxiv.org/abs/2306.12559
Video captioning aims to describe the content of videos using natural language. Although significant progress has been made, there is still much room to improve the performance for real-world applications, mainly due to the long-tail words challenge.
Externí odkaz:
http://arxiv.org/abs/2303.12423
Large-scale pre-trained multi-modal models (e.g., CLIP) demonstrate strong zero-shot transfer capability in many discriminative tasks. Their adaptation to zero-shot image-conditioned text generation tasks has drawn increasing interest. Prior arts app
Externí odkaz:
http://arxiv.org/abs/2303.03032
This paper describes our champion solution for the CVPR2022 Generic Event Boundary Captioning (GEBC) competition. GEBC requires the captioning model to have a comprehension of instantaneous status changes around the given video boundary, which makes
Externí odkaz:
http://arxiv.org/abs/2207.03038
This report presents the algorithm used in the submission of Generic Event Boundary Detection (GEBD) Challenge at CVPR 2022. In this work, we improve the existing Structured Context Transformer (SC-Transformer) method for GEBD. Specifically, a transf
Externí odkaz:
http://arxiv.org/abs/2206.12634
Autor:
Li, Congcong, Wang, Xinyao, Hong, Dexiang, Wang, Yufei, Zhang, Libo, Luo, Tiejian, Wen, Longyin
Generic Event Boundary Detection (GEBD) aims to detect moments where humans naturally perceive as event boundaries. In this paper, we present Structured Context Transformer (or SC-Transformer) to solve the GEBD task, which can be trained in an end-to
Externí odkaz:
http://arxiv.org/abs/2206.02985