Výsledky vyhledávání

Report

Training-free Regional Prompting for Diffusion Transformers

Autor: Chen, Anthony, Xu, Jianjin, Zheng, Wenzhao, Dai, Gaole, Wang, Yida, Zhang, Renrui, Wang, Haofan, Zhang, Shanghang

Diffusion models have demonstrated excellent capabilities in text-to-image generation. Their semantic understanding (i.e., prompt following) ability has also been greatly improved with large language models (e.g., T5, Llama). However, existing models

Externí odkaz: http://arxiv.org/abs/2411.02395

Zobrazit plný text záznamu

Report

HeightMapNet: Explicit Height Modeling for End-to-End HD Map Learning

Autor: Qiu, Wenzhao, Pang, Shanmin, zhang, Hao, Fang, Jianwu, Xue, Jianru

Recent advances in high-definition (HD) map construction from surround-view images have highlighted their cost-effectiveness in deployment. However, prevailing techniques often fall short in accurately extracting and utilizing road features, as well

Externí odkaz: http://arxiv.org/abs/2411.01408

Zobrazit plný text záznamu

Report

PixelGaussian: Generalizable 3D Gaussian Reconstruction from Arbitrary Views

Autor: Fei, Xin, Zheng, Wenzhao, Duan, Yueqi, Zhan, Wei, Tomizuka, Masayoshi, Keutzer, Kurt, Lu, Jiwen

We propose PixelGaussian, an efficient feed-forward framework for learning generalizable 3D Gaussian reconstruction from arbitrary views. Most existing methods rely on uniform pixel-wise Gaussian representations, which learn a fixed number of 3D Gaus

Externí odkaz: http://arxiv.org/abs/2410.18979

Zobrazit plný text záznamu

Report

UniDrive: Towards Universal Driving Perception Across Camera Configurations

Autor: Li, Ye, Zheng, Wenzhao, Huang, Xiaonan, Keutzer, Kurt

Vision-centric autonomous driving has demonstrated excellent performance with economical sensors. As the fundamental step, 3D perception aims to infer 3D information from 2D images based on 3D-2D projection. This makes driving perception models susce

Externí odkaz: http://arxiv.org/abs/2410.13864

Zobrazit plný text záznamu

Report

V2M: Visual 2-Dimensional Mamba for Image Representation Learning

Autor: Wang, Chengkun, Zheng, Wenzhao, Huang, Yuanhui, Zhou, Jie, Lu, Jiwen

Mamba has garnered widespread attention due to its flexible design and efficient hardware performance to process 1D sequences based on the state space model (SSM). Recent studies have attempted to apply Mamba to the visual domain by flattening 2D ima

Externí odkaz: http://arxiv.org/abs/2410.10382

Zobrazit plný text záznamu

Report

GlobalMamba: Global Image Serialization for Vision Mamba

Autor: Wang, Chengkun, Zheng, Wenzhao, Zhou, Jie, Lu, Jiwen

Vision mambas have demonstrated strong performance with linear complexity to the number of vision tokens. Their efficiency results from processing image tokens sequentially. However, most existing methods employ patch-based image tokenization and the

Externí odkaz: http://arxiv.org/abs/2410.10316

Zobrazit plný text záznamu

Report

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

Autor: Zhang, Yuan, Fan, Chun-Kai, Ma, Junpeng, Zheng, Wenzhao, Huang, Tao, Cheng, Kuan, Gudovskiy, Denis, Okuno, Tomoyuki, Nakata, Yohei, Keutzer, Kurt, Zhang, Shanghang

In vision-language models (VLMs), visual tokens usually consume a significant amount of computational overhead, despite their sparser information density compared to text tokens. To address this, most existing methods learn a network to prune redunda

Externí odkaz: http://arxiv.org/abs/2410.04417

Zobrazit plný text záznamu

Report

FoAM: Foresight-Augmented Multi-Task Imitation Policy for Robotic Manipulation

Autor: Liu, Litao, Wang, Wentao, Han, Yifan, Xie, Zhuoli, Yi, Pengfei, Li, Junyan, Qin, Yi, Lian, Wenzhao

Multi-task imitation learning (MTIL) has shown significant potential in robotic manipulation by enabling agents to perform various tasks using a unified policy. This simplifies the policy deployment and enhances the agent's adaptability across differ

Externí odkaz: http://arxiv.org/abs/2409.19528

Zobrazit plný text záznamu

Report

Technical Report: Competition Solution For Modelscope-Sora

Autor: Chen, Shengfu, Liu, Hailong, Wei, Wenzhao

This report presents the approach adopted in the Modelscope-Sora challenge, which focuses on fine-tuning data for video generation models. The challenge evaluates participants' ability to analyze, clean, and generate high-quality datasets for video-b

Externí odkaz: http://arxiv.org/abs/2410.07194

Zobrazit plný text záznamu

Report

Simplified Mamba with Disentangled Dependency Encoding for Long-Term Time Series Forecasting

Autor: Weng, Zixuan, Han, Jindong, Jiang, Wenzhao, Liu, Hao

Recent advances in deep learning have led to the development of numerous models for Long-term Time Series Forecasting (LTSF). However, most approaches still struggle to comprehensively capture reliable and informative dependencies inherent in time se

Externí odkaz: http://arxiv.org/abs/2408.12068

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání