Zobrazeno 1 - 10
of 283
pro vyhledávání: '"Zou, YueXian"'
Image editing has advanced significantly with the development of diffusion models using both inversion-based and instruction-based methods. However, current inversion-based approaches struggle with big modifications (e.g., adding or removing objects)
Externí odkaz:
http://arxiv.org/abs/2412.10316
Autor:
Yao, Ziyu, Li, Jialin, Zhou, Yifeng, Liu, Yong, Jiang, Xi, Wang, Chengjie, Zheng, Feng, Zou, Yuexian, Li, Lei
Controllable generation, which enables fine-grained control over generated outputs, has emerged as a critical focus in visual generative models. Currently, there are two primary technical approaches in visual generation: diffusion models and autoregr
Externí odkaz:
http://arxiv.org/abs/2410.04671
Existing audio-text retrieval (ATR) methods are essentially discriminative models that aim to maximize the conditional likelihood, represented as p(candidates|query). Nevertheless, this methodology fails to consider the intrinsic data distribution p(
Externí odkaz:
http://arxiv.org/abs/2409.10025
Most existing audio-text retrieval (ATR) approaches typically rely on a single-level interaction to associate audio and text, limiting their ability to align different modalities and leading to suboptimal matches. In this work, we present a novel ATR
Externí odkaz:
http://arxiv.org/abs/2409.09256
Autor:
Li, Yaowei, Wang, Xintao, Zhang, Zhaoyang, Wang, Zhouxia, Yuan, Ziyang, Xie, Liangbin, Zou, Yuexian, Shan, Ying
Filmmaking and animation production often require sophisticated techniques for coordinating camera transitions and object movements, typically involving labor-intensive real-world capturing. Despite advancements in generative AI for video creation, a
Externí odkaz:
http://arxiv.org/abs/2406.15339
The performance of large language models (LLMs) is acutely sensitive to the phrasing of prompts, which raises significant concerns about their reliability in real-world scenarios. Existing studies often divide prompts into task-level instructions and
Externí odkaz:
http://arxiv.org/abs/2406.10248
Spoken language understanding (SLU) is a core task in task-oriented dialogue systems, which aims at understanding the user's current goal through constructing semantic frames. SLU usually consists of two subtasks, including intent detection and slot
Externí odkaz:
http://arxiv.org/abs/2405.20852
Autor:
Kelly, Chris, Hu, Luhui, Hu, Jiayin, Tian, Yu, Yang, Deshun, Yang, Bang, Yang, Cindy, Li, Zihao, Huang, Zaoshan, Zou, Yuexian
The evolution of text to visual components facilitates people's daily lives, such as generating image, videos from text and identifying the desired elements within the images. Computer vision models involving the multimodal abilities in the previous
Externí odkaz:
http://arxiv.org/abs/2403.09530
Autor:
Kelly, Chris, Hu, Luhui, Yang, Bang, Tian, Yu, Yang, Deshun, Yang, Cindy, Huang, Zaoshan, Li, Zihao, Hu, Jiayin, Zou, Yuexian
With the emergence of large language models (LLMs) and vision foundation models, how to combine the intelligence and capacity of these open-sourced or API-available models to achieve open-world visual perception remains an open question. In this pape
Externí odkaz:
http://arxiv.org/abs/2403.09027
Autor:
Yang, Deshun, Hu, Luhui, Tian, Yu, Li, Zihao, Kelly, Chris, Yang, Bang, Yang, Cindy, Zou, Yuexian
Several text-to-video diffusion models have demonstrated commendable capabilities in synthesizing high-quality video content. However, it remains a formidable challenge pertaining to maintaining temporal consistency and ensuring action smoothness thr
Externí odkaz:
http://arxiv.org/abs/2403.07944