Výsledky vyhledávání

Report

MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval

Autor: Cai, Weitong, Huang, Jiabo, Gong, Shaogang, Jin, Hailin, Liu, Yang

Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query. Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a f

Externí odkaz: http://arxiv.org/abs/2406.17880

Zobrazit plný text záznamu

Report

Generative Video Diffusion for Unseen Cross-Domain Video Moment Retrieval

Autor: Luo, Dezhao, Gong, Shaogang, Huang, Jiabo, Jin, Hailin, Liu, Yang

Video Moment Retrieval (VMR) requires precise modelling of fine-grained moment-text associations to capture intricate visual-language relationships. Due to the lack of a diverse and generalisable VMR dataset to facilitate learning scalable moment-tex

Externí odkaz: http://arxiv.org/abs/2401.13329

Zobrazit plný text záznamu

Report

Efficient Adaptive Human-Object Interaction Detection with Concept-guided Memory

Autor: Lei, Ting, Caba, Fabian, Chen, Qingchao, Jin, Hailin, Peng, Yuxin, Liu, Yang

Human Object Interaction (HOI) detection aims to localize and infer the relationships between a human and an object. Arguably, training supervised models for this task from scratch presents challenges due to the performance drop over rare classes and

Externí odkaz: http://arxiv.org/abs/2309.03696

Zobrazit plný text záznamu

Report

Zero-Shot Video Moment Retrieval from Frozen Vision-Language Models

Autor: Luo, Dezhao, Huang, Jiabo, Gong, Shaogang, Jin, Hailin, Liu, Yang

Accurate video moment retrieval (VMR) requires universal visual-textual correlations that can handle unknown vocabulary and unseen scenes. However, the learned correlations are likely either biased when derived from a limited amount of moment-text da

Externí odkaz: http://arxiv.org/abs/2309.00661

Zobrazit plný text záznamu

Report

Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection to Image-Text Pre-Training

Autor: Luo, Dezhao, Huang, Jiabo, Gong, Shaogang, Jin, Hailin, Liu, Yang

The correlation between the vision and text is essential for video moment retrieval (VMR), however, existing methods heavily rely on separate pre-training feature extractors for visual and textual understanding. Without sufficient temporal boundary a

Externí odkaz: http://arxiv.org/abs/2303.00040

Zobrazit plný text záznamu

Report

LiveSeg: Unsupervised Multimodal Temporal Segmentation of Long Livestream Videos

Autor: Qiu, Jielin, Dernoncourt, Franck, Bui, Trung, Wang, Zhaowen, Zhao, Ding, Jin, Hailin

Livestream videos have become a significant part of online learning, where design, digital marketing, creative painting, and other skills are taught by experienced experts in the sessions, making them valuable materials. However, Livestream tutorial

Externí odkaz: http://arxiv.org/abs/2210.05840

Zobrazit plný text záznamu

Report

Semantics-Consistent Cross-domain Summarization via Optimal Transport Alignment

Autor: Qiu, Jielin, Zhu, Jiacheng, Xu, Mengdi, Dernoncourt, Franck, Bui, Trung, Wang, Zhaowen, Li, Bo, Zhao, Ding, Jin, Hailin

Multimedia summarization with multimodal output (MSMO) is a recently explored application in language grounding. It plays an essential role in real-world applications, i.e., automatically generating cover images and titles for news articles or provid

Externí odkaz: http://arxiv.org/abs/2210.04722

Zobrazit plný text záznamu

Report

Video Activity Localisation with Uncertainties in Temporal Boundary

Autor: Huang, Jiabo, Jin, Hailin, Gong, Shaogang, Liu, Yang

Current methods for video activity localisation over time assume implicitly that activity temporal boundaries labelled for model training are determined and precise. However, in unscripted natural videos, different activities mostly transit smoothly,

Externí odkaz: http://arxiv.org/abs/2206.12923

Zobrazit plný text záznamu

Report

MHMS: Multimodal Hierarchical Multimedia Summarization

Autor: Qiu, Jielin, Zhu, Jiacheng, Xu, Mengdi, Dernoncourt, Franck, Bui, Trung, Wang, Zhaowen, Li, Bo, Zhao, Ding, Jin, Hailin

Multimedia summarization with multimodal output can play an essential role in real-world applications, i.e., automatically generating cover images and titles for news articles or providing introductions to online videos. In this work, we propose a mu

Externí odkaz: http://arxiv.org/abs/2204.03734

Zobrazit plný text záznamu

Report

StyleBabel: Artistic Style Tagging and Captioning

Autor: Ruta, Dan, Gilbert, Andrew, Aggarwal, Pranav, Marri, Naveen, Kale, Ajinkya, Briggs, Jo, Speed, Chris, Jin, Hailin, Faieta, Baldo, Filipkowski, Alex, Lin, Zhe, Collomosse, John

We present StyleBabel, a unique open access dataset of natural language captions and free-form tags describing the artistic style of over 135K digital artworks, collected via a novel participatory method from experts studying at specialist art and de

Externí odkaz: http://arxiv.org/abs/2203.05321

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání