Zobrazeno 1 - 10
of 206
pro vyhledávání: '"Jin, Hailin"'
Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query. Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a f
Externí odkaz:
http://arxiv.org/abs/2406.17880
Video Moment Retrieval (VMR) requires precise modelling of fine-grained moment-text associations to capture intricate visual-language relationships. Due to the lack of a diverse and generalisable VMR dataset to facilitate learning scalable moment-tex
Externí odkaz:
http://arxiv.org/abs/2401.13329
Human Object Interaction (HOI) detection aims to localize and infer the relationships between a human and an object. Arguably, training supervised models for this task from scratch presents challenges due to the performance drop over rare classes and
Externí odkaz:
http://arxiv.org/abs/2309.03696
Accurate video moment retrieval (VMR) requires universal visual-textual correlations that can handle unknown vocabulary and unseen scenes. However, the learned correlations are likely either biased when derived from a limited amount of moment-text da
Externí odkaz:
http://arxiv.org/abs/2309.00661
The correlation between the vision and text is essential for video moment retrieval (VMR), however, existing methods heavily rely on separate pre-training feature extractors for visual and textual understanding. Without sufficient temporal boundary a
Externí odkaz:
http://arxiv.org/abs/2303.00040
Livestream videos have become a significant part of online learning, where design, digital marketing, creative painting, and other skills are taught by experienced experts in the sessions, making them valuable materials. However, Livestream tutorial
Externí odkaz:
http://arxiv.org/abs/2210.05840
Autor:
Qiu, Jielin, Zhu, Jiacheng, Xu, Mengdi, Dernoncourt, Franck, Bui, Trung, Wang, Zhaowen, Li, Bo, Zhao, Ding, Jin, Hailin
Multimedia summarization with multimodal output (MSMO) is a recently explored application in language grounding. It plays an essential role in real-world applications, i.e., automatically generating cover images and titles for news articles or provid
Externí odkaz:
http://arxiv.org/abs/2210.04722
Current methods for video activity localisation over time assume implicitly that activity temporal boundaries labelled for model training are determined and precise. However, in unscripted natural videos, different activities mostly transit smoothly,
Externí odkaz:
http://arxiv.org/abs/2206.12923
Autor:
Qiu, Jielin, Zhu, Jiacheng, Xu, Mengdi, Dernoncourt, Franck, Bui, Trung, Wang, Zhaowen, Li, Bo, Zhao, Ding, Jin, Hailin
Multimedia summarization with multimodal output can play an essential role in real-world applications, i.e., automatically generating cover images and titles for news articles or providing introductions to online videos. In this work, we propose a mu
Externí odkaz:
http://arxiv.org/abs/2204.03734
Autor:
Ruta, Dan, Gilbert, Andrew, Aggarwal, Pranav, Marri, Naveen, Kale, Ajinkya, Briggs, Jo, Speed, Chris, Jin, Hailin, Faieta, Baldo, Filipkowski, Alex, Lin, Zhe, Collomosse, John
We present StyleBabel, a unique open access dataset of natural language captions and free-form tags describing the artistic style of over 135K digital artworks, collected via a novel participatory method from experts studying at specialist art and de
Externí odkaz:
http://arxiv.org/abs/2203.05321