Výsledky vyhledávání - "Zou, YueXian"

Report

BrushEdit: All-In-One Image Inpainting and Editing

Autor: Li, Yaowei, Bian, Yuxuan, Ju, Xuan, Zhang, Zhaoyang, Shan, Ying, Zou, Yuexian, Xu, Qiang

Image editing has advanced significantly with the development of diffusion models using both inversion-based and instruction-based methods. However, current inversion-based approaches struggle with big modifications (e.g., adding or removing objects)

Externí odkaz: http://arxiv.org/abs/2412.10316

Zobrazit plný text záznamu

Report

CAR: Controllable Autoregressive Modeling for Visual Generation

Autor: Yao, Ziyu, Li, Jialin, Zhou, Yifeng, Liu, Yong, Jiang, Xi, Wang, Chengjie, Zheng, Feng, Zou, Yuexian, Li, Lei

Controllable generation, which enables fine-grained control over generated outputs, has emerged as a critical focus in visual generative models. Currently, there are two primary technical approaches in visual generation: diffusion models and autoregr

Externí odkaz: http://arxiv.org/abs/2410.04671

Zobrazit plný text záznamu

Report

DiffATR: Diffusion-based Generative Modeling for Audio-Text Retrieval

Autor: Xin, Yifei, Cheng, Xuxin, Zhu, Zhihong, Yang, Xusheng, Zou, Yuexian

Existing audio-text retrieval (ATR) methods are essentially discriminative models that aim to maximize the conditional likelihood, represented as p(candidates|query). Nevertheless, this methodology fails to consider the intrinsic data distribution p(

Externí odkaz: http://arxiv.org/abs/2409.10025

Zobrazit plný text záznamu

Report

Audio-text Retrieval with Transformer-based Hierarchical Alignment and Disentangled Cross-modal Representation

Autor: Xin, Yifei, Zhu, Zhihong, Cheng, Xuxin, Yang, Xusheng, Zou, Yuexian

Most existing audio-text retrieval (ATR) approaches typically rely on a single-level interaction to associate audio and text, limiting their ability to align different modalities and leading to suboptimal matches. In this work, we present a novel ATR

Externí odkaz: http://arxiv.org/abs/2409.09256

Zobrazit plný text záznamu

Report

Image Conductor: Precision Control for Interactive Video Synthesis

Autor: Li, Yaowei, Wang, Xintao, Zhang, Zhaoyang, Wang, Zhouxia, Yuan, Ziyang, Xie, Liangbin, Zou, Yuexian, Shan, Ying

Filmmaking and animation production often require sophisticated techniques for coordinating camera transitions and object movements, typically involving labor-intensive real-world capturing. Despite advancements in generative AI for video creation, a

Externí odkaz: http://arxiv.org/abs/2406.15339

Zobrazit plný text záznamu

Report

On the Worst Prompt Performance of Large Language Models

Autor: Cao, Bowen, Cai, Deng, Zhang, Zhisong, Zou, Yuexian, Lam, Wai

The performance of large language models (LLMs) is acutely sensitive to the phrasing of prompts, which raises significant concerns about their reliability in real-world scenarios. Existing studies often divide prompts into task-level instructions and

Externí odkaz: http://arxiv.org/abs/2406.10248

Zobrazit plný text záznamu

Report

Towards Spoken Language Understanding via Multi-level Multi-grained Contrastive Learning

Autor: Cheng, Xuxin, Xu, Wanshi, Zhu, Zhihong, Li, Hongxiang, Zou, Yuexian

Spoken language understanding (SLU) is a core task in task-oriented dialogue systems, which aims at understanding the user's current goal through constructing semantic frames. SLU usually consists of two subtasks, including intent detection and slot

Externí odkaz: http://arxiv.org/abs/2405.20852

Zobrazit plný text záznamu

Report

VisionGPT-3D: A Generalized Multimodal Agent for Enhanced 3D Vision Understanding

Autor: Kelly, Chris, Hu, Luhui, Hu, Jiayin, Tian, Yu, Yang, Deshun, Yang, Bang, Yang, Cindy, Li, Zihao, Huang, Zaoshan, Zou, Yuexian

The evolution of text to visual components facilitates people's daily lives, such as generating image, videos from text and identifying the desired elements within the images. Computer vision models involving the multimodal abilities in the previous

Externí odkaz: http://arxiv.org/abs/2403.09530

Zobrazit plný text záznamu

Report

VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework

Autor: Kelly, Chris, Hu, Luhui, Yang, Bang, Tian, Yu, Yang, Deshun, Yang, Cindy, Huang, Zaoshan, Li, Zihao, Hu, Jiayin, Zou, Yuexian

With the emergence of large language models (LLMs) and vision foundation models, how to combine the intelligence and capacity of these open-sourced or API-available models to achieve open-world visual perception remains an open question. In this pape

Externí odkaz: http://arxiv.org/abs/2403.09027

Zobrazit plný text záznamu

Report

WorldGPT: A Sora-Inspired Video AI Agent as Rich World Models from Text and Image Inputs

Autor: Yang, Deshun, Hu, Luhui, Tian, Yu, Li, Zihao, Kelly, Chris, Yang, Bang, Yang, Cindy, Zou, Yuexian

Several text-to-video diffusion models have demonstrated commendable capabilities in synthesizing high-quality video content. However, it remains a formidable challenge pertaining to maintaining temporal consistency and ensuring action smoothness thr

Externí odkaz: http://arxiv.org/abs/2403.07944

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání