Zobrazeno 1 - 10
of 6 070
pro vyhledávání: '"Video model"'
Remote sensing image change caption (RSICC) aims to provide natural language descriptions for bi-temporal remote sensing images. Since Change Caption (CC) task requires both spatial and temporal features, previous works follow an encoder-fusion-decod
Externí odkaz:
http://arxiv.org/abs/2410.23946
Recent advancements in text-to-video (T2V) generative models have shown impressive capabilities. However, these models are still inadequate in aligning synthesized videos with human preferences (e.g., accurately reflecting text descriptions), which i
Externí odkaz:
http://arxiv.org/abs/2412.04814
Autor:
Parikh, Vatsal Vinay
The recent introduction of OpenAI's text-to-video model Sora has sparked widespread public discourse across online communities. This study aims to uncover the dominant themes and narratives surrounding Sora by conducting topic modeling analysis on a
Externí odkaz:
http://arxiv.org/abs/2407.13071
Autor:
Zhao, Yue, Krähenbühl, Philipp
Videos are big, complex to pre-process, and slow to train on. State-of-the-art large-scale video models are trained on clusters of 32 or more GPUs for several days. As a consequence, academia largely ceded the training of large video models to indust
Externí odkaz:
http://arxiv.org/abs/2309.16669
Autor:
Chen, Gang
We present a general and simple text to video model based on Transformer. Since both text and video are sequential data, we encode both texts and images into the same hidden space, which are further fed into Transformer to capture the temporal consis
Externí odkaz:
http://arxiv.org/abs/2309.14683
We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes. For this task, the videos are required to be aligned both globally and temporally with the input audio: globally,
Externí odkaz:
http://arxiv.org/abs/2309.16429
Evaluating the quality of videos generated from text-to-video (T2V) models is important if they are to produce plausible outputs that convince a viewer of their authenticity. We examine some of the metrics used in this area and highlight their limita
Externí odkaz:
http://arxiv.org/abs/2309.08009
Akademický článek
Tento výsledek nelze pro nepřihlášené uživatele zobrazit.
K zobrazení výsledku je třeba se přihlásit.
K zobrazení výsledku je třeba se přihlásit.
Video understanding tasks have traditionally been modeled by two separate architectures, specially tailored for two distinct tasks. Sequence-based video tasks, such as action recognition, use a video backbone to directly extract spatiotemporal featur
Externí odkaz:
http://arxiv.org/abs/2303.17228
The self-supervised ultrasound (US) video model pretraining can use a small amount of labeled data to achieve one of the most promising results on US diagnosis. However, it does not take full advantage of multi-level knowledge for learning deep neura
Externí odkaz:
http://arxiv.org/abs/2210.04477