Zobrazeno 1 - 10
of 82
pro vyhledávání: '"SIAROHIN, ALIAKSANDR"'
Autor:
Haji-Ali, Moayed, Menapace, Willi, Siarohin, Aliaksandr, Skorokhodov, Ivan, Canberk, Alper, Lee, Kwot Sin, Ordonez, Vicente, Tulyakov, Sergey
We propose AV-Link, a unified framework for Video-to-Audio and Audio-to-Video generation that leverages the activations of frozen video and audio diffusion models for temporally-aligned cross-modal conditioning. The key to our framework is a Fusion B
Externí odkaz:
http://arxiv.org/abs/2412.15191
We propose DiTFlow, a method for transferring the motion of a reference video to a newly synthesized one, designed specifically for Diffusion Transformers (DiT). We first process the reference video with a pre-trained DiT to analyze cross-frame atten
Externí odkaz:
http://arxiv.org/abs/2412.07776
Autor:
Wu, Ziyi, Siarohin, Aliaksandr, Menapace, Willi, Skorokhodov, Ivan, Fang, Yuwei, Chordia, Varnith, Gilitschenski, Igor, Tulyakov, Sergey
Real-world videos consist of sequences of events. Generating such sequences with precise temporal control is infeasible with existing video generators that rely on a single paragraph of text as input. When tasked with generating multiple events descr
Externí odkaz:
http://arxiv.org/abs/2412.05263
Autor:
Wang, Chaoyang, Zhuang, Peiye, Ngo, Tuan Duc, Menapace, Willi, Siarohin, Aliaksandr, Vasilkovsky, Michael, Skorokhodov, Ivan, Tulyakov, Sergey, Wonka, Peter, Lee, Hsin-Ying
We propose 4Real-Video, a novel framework for generating 4D videos, organized as a grid of video frames with both time and viewpoint axes. In this grid, each row contains frames sharing the same timestep, while each column contains frames from the sa
Externí odkaz:
http://arxiv.org/abs/2412.04462
Autor:
Bahmani, Sherwin, Skorokhodov, Ivan, Qian, Guocheng, Siarohin, Aliaksandr, Menapace, Willi, Tagliasacchi, Andrea, Lindell, David B., Tulyakov, Sergey
Numerous works have recently integrated 3D camera control into foundational text-to-video models, but the resulting camera control is often imprecise, and video generation quality suffers. In this work, we analyze camera motion from a first principle
Externí odkaz:
http://arxiv.org/abs/2411.18673
Autor:
Kag, Anil, Coskun, Huseyin, Chen, Jierun, Cao, Junli, Menapace, Willi, Siarohin, Aliaksandr, Tulyakov, Sergey, Ren, Jian
Neural network architecture design requires making many crucial decisions. The common desiderata is that similar decisions, with little modifications, can be reused in a variety of tasks and applications. To satisfy that, architectures must provide p
Externí odkaz:
http://arxiv.org/abs/2411.04967
Autor:
Tang, Zhenggang, Zhuang, Peiye, Wang, Chaoyang, Siarohin, Aliaksandr, Kant, Yash, Schwing, Alexander, Tulyakov, Sergey, Lee, Hsin-Ying
The task of image-to-multi-view generation refers to generating novel views of an instance from a single image. Recent methods achieve this by extending text-to-image latent diffusion models to multi-view version, which contains an VAE image encoder
Externí odkaz:
http://arxiv.org/abs/2408.14016
Autor:
Bahmani, Sherwin, Skorokhodov, Ivan, Siarohin, Aliaksandr, Menapace, Willi, Qian, Guocheng, Vasilkovsky, Michael, Lee, Hsin-Ying, Wang, Chaoyang, Zou, Jiaxu, Tagliasacchi, Andrea, Lindell, David B., Tulyakov, Sergey
Modern text-to-video synthesis models demonstrate coherent, photorealistic generation of complex videos from a text description. However, most existing models lack fine-grained control over camera movement, which is critical for downstream applicatio
Externí odkaz:
http://arxiv.org/abs/2407.12781
Autor:
Fang, Yuwei, Menapace, Willi, Siarohin, Aliaksandr, Chen, Tsai-Shien, Wang, Kuan-Chien, Skorokhodov, Ivan, Neubig, Graham, Tulyakov, Sergey
Existing text-to-video diffusion models rely solely on text-only encoders for their pretraining. This limitation stems from the absence of large-scale multimodal prompt video datasets, resulting in a lack of visual grounding and restricting their ver
Externí odkaz:
http://arxiv.org/abs/2407.06304
Autor:
Haji-Ali, Moayed, Menapace, Willi, Siarohin, Aliaksandr, Balakrishnan, Guha, Tulyakov, Sergey, Ordonez, Vicente
Generating ambient sounds is a challenging task due to data scarcity and often insufficient caption quality, making it difficult to employ large-scale generative models for the task. In this work, we tackle this problem by introducing two new models.
Externí odkaz:
http://arxiv.org/abs/2406.19388