Výsledky vyhledávání - "Wen, Longyin"

Report

Beyond Raw Videos: Understanding Edited Videos with Large Multimodal Model

Autor: Xu, Lu, Zhu, Sijie, Li, Chunyuan, Kuo, Chia-Wen, Chen, Fan, Wang, Xinyao, Chen, Guang, Du, Dawei, Yuan, Ye, Wen, Longyin

The emerging video LMMs (Large Multimodal Models) have achieved significant improvements on generic video understanding in the form of VQA (Visual Question Answering), where the raw videos are captured by cameras. However, a large portion of videos i

Externí odkaz: http://arxiv.org/abs/2406.10484

Zobrazit plný text záznamu

Report

CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts

Autor: Li, Jiachen, Wang, Xinyao, Zhu, Sijie, Kuo, Chia-Wen, Xu, Lu, Chen, Fan, Jain, Jitesh, Shi, Humphrey, Wen, Longyin

Recent advancements in Multimodal Large Language Models (LLMs) have focused primarily on scaling by increasing text-image pair data and enhancing LLMs to improve performance on multimodal tasks. However, these scaling approaches are computationally e

Externí odkaz: http://arxiv.org/abs/2405.05949

Zobrazit plný text záznamu

Report

Edit3K: Universal Representation Learning for Video Editing Components

Autor: Gu, Xin, Zhang, Libo, Chen, Fan, Wen, Longyin, Wang, Yufei, Luo, Tiejian, Zhu, Sijie

This paper focuses on understanding the predominant video creation pipeline, i.e., compositional video editing with six main types of editing components, including video effects, animation, transition, filter, sticker, and text. In contrast to existi

Externí odkaz: http://arxiv.org/abs/2403.16048

Zobrazit plný text záznamu

Report

Accurate and Fast Compressed Video Captioning

Autor: Shen, Yaojie, Gu, Xin, Xu, Kai, Fan, Heng, Wen, Longyin, Zhang, Libo

Existing video captioning approaches typically require to first sample video frames from a decoded video and then conduct a subsequent process (e.g., feature extraction and/or captioning model learning). In this pipeline, manual frame sampling may ig

Externí odkaz: http://arxiv.org/abs/2309.12867

Zobrazit plný text záznamu

Report

Exploring the Role of Audio in Video Captioning

Autor: Shen, Yuhan, Yang, Linjie, Wen, Longyin, Yu, Haichao, Elhamifar, Ehsan, Wang, Heng

Recent focus in video captioning has been on designing architectures that can consume both video and text modalities, and using large-scale video datasets with text transcripts for pre-training, such as HowTo100M. Though these approaches have achieve

Externí odkaz: http://arxiv.org/abs/2306.12559

Zobrazit plný text záznamu

Report

Text with Knowledge Graph Augmented Transformer for Video Captioning

Autor: Gu, Xin, Chen, Guang, Wang, Yufei, Zhang, Libo, Luo, Tiejian, Wen, Longyin

Video captioning aims to describe the content of videos using natural language. Although significant progress has been made, there is still much room to improve the performance for real-world applications, mainly due to the long-tail words challenge.

Externí odkaz: http://arxiv.org/abs/2303.12423

Zobrazit plný text záznamu

Report

DeCap: Decoding CLIP Latents for Zero-Shot Captioning via Text-Only Training

Autor: Li, Wei, Zhu, Linchao, Wen, Longyin, Yang, Yi

Large-scale pre-trained multi-modal models (e.g., CLIP) demonstrate strong zero-shot transfer capability in many discriminative tasks. Their adaptation to zero-shot image-conditioned text generation tasks has drawn increasing interest. Prior arts app

Externí odkaz: http://arxiv.org/abs/2303.03032

Zobrazit plný text záznamu

Report

Dual-Stream Transformer for Generic Event Boundary Captioning

Autor: Gu, Xin, Ye, Hanhua, Chen, Guang, Wang, Yufei, Zhang, Libo, Wen, Longyin

This paper describes our champion solution for the CVPR2022 Generic Event Boundary Captioning (GEBC) competition. GEBC requires the captioning model to have a comprehension of instantaneous status changes around the given video boundary, which makes

Externí odkaz: http://arxiv.org/abs/2207.03038

Zobrazit plný text záznamu

Report

SC-Transformer++: Structured Context Transformer for Generic Event Boundary Detection

Autor: Hong, Dexiang, Ma, Xiaoqi, Wang, Xinyao, Li, Congcong, Wang, Yufei, Wen, Longyin

This report presents the algorithm used in the submission of Generic Event Boundary Detection (GEBD) Challenge at CVPR 2022. In this work, we improve the existing Structured Context Transformer (SC-Transformer) method for GEBD. Specifically, a transf

Externí odkaz: http://arxiv.org/abs/2206.12634

Zobrazit plný text záznamu

Report

Structured Context Transformer for Generic Event Boundary Detection

Autor: Li, Congcong, Wang, Xinyao, Hong, Dexiang, Wang, Yufei, Zhang, Libo, Luo, Tiejian, Wen, Longyin

Generic Event Boundary Detection (GEBD) aims to detect moments where humans naturally perceive as event boundaries. In this paper, we present Structured Context Transformer (or SC-Transformer) to solve the GEBD task, which can be trained in an end-to

Externí odkaz: http://arxiv.org/abs/2206.02985

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání