One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos
Autor: | Bai, Zechen, He, Tong, Mei, Haiyang, Wang, Pichao, Gao, Ziteng, Chen, Joya, Liu, Lei, Zhang, Zheng, Shou, Mike Zheng |
---|---|
Rok vydání: | 2024 |
Předmět: | |
Druh dokumentu: | Working Paper |
Popis: | We introduce VideoLISA, a video-based multimodal large language model designed to tackle the problem of language-instructed reasoning segmentation in videos. Leveraging the reasoning capabilities and world knowledge of large language models, and augmented by the Segment Anything Model, VideoLISA generates temporally consistent segmentation masks in videos based on language instructions. Existing image-based methods, such as LISA, struggle with video tasks due to the additional temporal dimension, which requires temporal dynamic understanding and consistent segmentation across frames. VideoLISA addresses these challenges by integrating a Sparse Dense Sampling strategy into the video-LLM, which balances temporal context and spatial detail within computational constraints. Additionally, we propose a One-Token-Seg-All approach using a specially designed Comment: Accepted by NeurlPS 2024 |
Databáze: | arXiv |
Externí odkaz: |