Zobrazeno 1 - 10
of 4 154
pro vyhledávání: '"TAN, XIN"'
Instruction tuning guides the Multimodal Large Language Models (MLLMs) in aligning different modalities by designing text instructions, which seems to be an essential technique to enhance the capabilities and controllability of foundation models. In
Externí odkaz:
http://arxiv.org/abs/2410.10868
We propose DrivingForward, a feed-forward Gaussian Splatting model that reconstructs driving scenes from flexible surround-view input. Driving scene images from vehicle-mounted cameras are typically sparse, with limited overlap, and the movement of t
Externí odkaz:
http://arxiv.org/abs/2409.12753
Autor:
Tan, Xin, Zhao, Meng
Accurate prediction of traffic accidents across different times and regions is vital for public safety. However, existing methods face two key challenges: 1) Generalization: Current models rely heavily on manually constructed multi-view structures, l
Externí odkaz:
http://arxiv.org/abs/2409.05933
Autor:
Jin, Yizhang, Li, Jian, Zhang, Jiangning, Hu, Jianlong, Gan, Zhenye, Tan, Xin, Liu, Yong, Wang, Yabiao, Wang, Chengjie, Ma, Lizhuang
Visual Spatial Description (VSD) aims to generate texts that describe the spatial relationships between objects within images. Traditional visual spatial relationship classification (VSRC) methods typically output the spatial relationship between two
Externí odkaz:
http://arxiv.org/abs/2408.04957
Autor:
Zhao, Zhen, Tang, Jingqun, Wu, Binghong, Lin, Chunhui, Wei, Shu, Liu, Hao, Tan, Xin, Zhang, Zhizhong, Huang, Can, Xie, Yuan
In this work, we present TextHarmony, a unified and versatile multimodal generative model proficient in comprehending and generating visual text. Simultaneously generating images and texts typically results in performance degradation due to the inher
Externí odkaz:
http://arxiv.org/abs/2407.16364
Unsupervised visible infrared person re-identification (USVI-ReID) is a challenging retrieval task that aims to retrieve cross-modality pedestrian images without using any label information. In this task, the large cross-modality variance makes it di
Externí odkaz:
http://arxiv.org/abs/2407.12758
Autor:
Lian, Xiaoli, Wang, Shuaisong, Ma, Jieping, Liu, Fang, Tan, Xin, Zhang, Li, Shi, Lin, Gao, Cuiyun
Code generation, the task of producing source code from prompts, has seen significant advancements with the advent of pre-trained large language models (PLMs). Despite these achievements, there lacks a comprehensive taxonomy of weaknesses about the b
Externí odkaz:
http://arxiv.org/abs/2407.09793
LiDAR-camera 3D representation pretraining has shown significant promise for 3D perception tasks and related applications. However, two issues widely exist in this framework: 1) Solely keyframes are used for training. For example, in nuScenes, a subs
Externí odkaz:
http://arxiv.org/abs/2407.07465
Large language model (LLM)-based applications consist of both LLM and non-LLM components, each contributing to the end-to-end latency. Despite great efforts to optimize LLM inference, end-to-end workflow optimization has been overlooked. Existing fra
Externí odkaz:
http://arxiv.org/abs/2407.00326
Night-time scene parsing aims to extract pixel-level semantic information in night images, aiding downstream tasks in understanding scene object distribution. Due to limited labeled night image datasets, unsupervised domain adaptation (UDA) has becom
Externí odkaz:
http://arxiv.org/abs/2406.10531