Zobrazeno 1 - 10
of 7 426
pro vyhledávání: '"An, Wenzhao"'
Autor:
Chen, Anthony, Xu, Jianjin, Zheng, Wenzhao, Dai, Gaole, Wang, Yida, Zhang, Renrui, Wang, Haofan, Zhang, Shanghang
Diffusion models have demonstrated excellent capabilities in text-to-image generation. Their semantic understanding (i.e., prompt following) ability has also been greatly improved with large language models (e.g., T5, Llama). However, existing models
Externí odkaz:
http://arxiv.org/abs/2411.02395
Recent advances in high-definition (HD) map construction from surround-view images have highlighted their cost-effectiveness in deployment. However, prevailing techniques often fall short in accurately extracting and utilizing road features, as well
Externí odkaz:
http://arxiv.org/abs/2411.01408
Autor:
Fei, Xin, Zheng, Wenzhao, Duan, Yueqi, Zhan, Wei, Tomizuka, Masayoshi, Keutzer, Kurt, Lu, Jiwen
We propose PixelGaussian, an efficient feed-forward framework for learning generalizable 3D Gaussian reconstruction from arbitrary views. Most existing methods rely on uniform pixel-wise Gaussian representations, which learn a fixed number of 3D Gaus
Externí odkaz:
http://arxiv.org/abs/2410.18979
Vision-centric autonomous driving has demonstrated excellent performance with economical sensors. As the fundamental step, 3D perception aims to infer 3D information from 2D images based on 3D-2D projection. This makes driving perception models susce
Externí odkaz:
http://arxiv.org/abs/2410.13864
Mamba has garnered widespread attention due to its flexible design and efficient hardware performance to process 1D sequences based on the state space model (SSM). Recent studies have attempted to apply Mamba to the visual domain by flattening 2D ima
Externí odkaz:
http://arxiv.org/abs/2410.10382
Vision mambas have demonstrated strong performance with linear complexity to the number of vision tokens. Their efficiency results from processing image tokens sequentially. However, most existing methods employ patch-based image tokenization and the
Externí odkaz:
http://arxiv.org/abs/2410.10316
Autor:
Zhang, Yuan, Fan, Chun-Kai, Ma, Junpeng, Zheng, Wenzhao, Huang, Tao, Cheng, Kuan, Gudovskiy, Denis, Okuno, Tomoyuki, Nakata, Yohei, Keutzer, Kurt, Zhang, Shanghang
In vision-language models (VLMs), visual tokens usually consume a significant amount of computational overhead, despite their sparser information density compared to text tokens. To address this, most existing methods learn a network to prune redunda
Externí odkaz:
http://arxiv.org/abs/2410.04417
Autor:
Liu, Litao, Wang, Wentao, Han, Yifan, Xie, Zhuoli, Yi, Pengfei, Li, Junyan, Qin, Yi, Lian, Wenzhao
Multi-task imitation learning (MTIL) has shown significant potential in robotic manipulation by enabling agents to perform various tasks using a unified policy. This simplifies the policy deployment and enhances the agent's adaptability across differ
Externí odkaz:
http://arxiv.org/abs/2409.19528
This report presents the approach adopted in the Modelscope-Sora challenge, which focuses on fine-tuning data for video generation models. The challenge evaluates participants' ability to analyze, clean, and generate high-quality datasets for video-b
Externí odkaz:
http://arxiv.org/abs/2410.07194
Recent advances in deep learning have led to the development of numerous models for Long-term Time Series Forecasting (LTSF). However, most approaches still struggle to comprehensively capture reliable and informative dependencies inherent in time se
Externí odkaz:
http://arxiv.org/abs/2408.12068