Výsledky vyhledávání

Report

An Empirical Study of Mamba-based Language Models

Autor: Waleffe, Roger, Byeon, Wonmin, Riach, Duncan, Norick, Brandon, Korthikanti, Vijay, Dao, Tri, Gu, Albert, Hatamizadeh, Ali, Singh, Sudhakar, Narayanan, Deepak, Kulshreshtha, Garvit, Singh, Vartika, Casper, Jared, Kautz, Jan, Shoeybi, Mohammad, Catanzaro, Bryan

Selective state-space models (SSMs) like Mamba overcome some of the shortcomings of Transformers, such as quadratic computational complexity with sequence length and large inference-time memory requirements from the key-value cache. Moreover, recent

Externí odkaz: http://arxiv.org/abs/2406.07887

Zobrazit plný text záznamu

Report

Hydra-MDP: End-to-end Multimodal Planning with Multi-target Hydra-Distillation

Autor: Li, Zhenxin, Li, Kailin, Wang, Shihao, Lan, Shiyi, Yu, Zhiding, Ji, Yishen, Li, Zhiqi, Zhu, Ziyue, Kautz, Jan, Wu, Zuxuan, Jiang, Yu-Gang, Alvarez, Jose M.

We propose Hydra-MDP, a novel paradigm employing multiple teachers in a teacher-student model. This approach uses knowledge distillation from both human and rule-based teachers to train the student model, which features a multi-head decoder to learn

Externí odkaz: http://arxiv.org/abs/2406.06978

Zobrazit plný text záznamu

Report

Flextron: Many-in-One Flexible Large Language Model

Autor: Cai, Ruisi, Muralidharan, Saurav, Heinrich, Greg, Yin, Hongxu, Wang, Zhangyang, Kautz, Jan, Molchanov, Pavlo

Training modern LLMs is extremely resource intensive, and customizing them for various deployment scenarios characterized by limited compute and memory resources through repeated training is impractical. In this paper, we introduce Flextron, a networ

Externí odkaz: http://arxiv.org/abs/2406.10260

Zobrazit plný text záznamu

Report

CamCo: Camera-Controllable 3D-Consistent Image-to-Video Generation

Autor: Xu, Dejia, Nie, Weili, Liu, Chao, Liu, Sifei, Kautz, Jan, Wang, Zhangyang, Vahdat, Arash

Recently video diffusion models have emerged as expressive generative tools for high-quality video content creation readily available to general users. However, these models often do not offer precise control over camera poses for video generation, l

Externí odkaz: http://arxiv.org/abs/2406.02509

Zobrazit plný text záznamu

Report

SpatialRGPT: Grounded Spatial Reasoning in Vision Language Model

Autor: Cheng, An-Chieh, Yin, Hongxu, Fu, Yang, Guo, Qiushan, Yang, Ruihan, Kautz, Jan, Wang, Xiaolong, Liu, Sifei

Vision Language Models (VLMs) have demonstrated remarkable performance in 2D vision and language tasks. However, their ability to reason about spatial arrangements remains limited. In this work, we introduce Spatial Region GPT (SpatialRGPT) to enhanc

Externí odkaz: http://arxiv.org/abs/2406.01584

Zobrazit plný text záznamu

Report

X-VILA: Cross-Modality Alignment for Large Language Model

Autor: Ye, Hanrong, Huang, De-An, Lu, Yao, Yu, Zhiding, Ping, Wei, Tao, Andrew, Kautz, Jan, Han, Song, Xu, Dan, Molchanov, Pavlo, Yin, Hongxu

We introduce X-VILA, an omni-modality model designed to extend the capabilities of large language models (LLMs) by incorporating image, video, and audio modalities. By aligning modality-specific encoders with LLM inputs and diffusion decoders with LL

Externí odkaz: http://arxiv.org/abs/2405.19335

Zobrazit plný text záznamu

Report

OmniDrive: A Holistic LLM-Agent Framework for Autonomous Driving with 3D Perception, Reasoning and Planning

Autor: Wang, Shihao, Yu, Zhiding, Jiang, Xiaohui, Lan, Shiyi, Shi, Min, Chang, Nadine, Kautz, Jan, Li, Ying, Alvarez, Jose M.

The advances in multimodal large language models (MLLMs) have led to growing interests in LLM-based autonomous driving agents to leverage their strong reasoning capabilities. However, capitalizing on MLLMs' strong reasoning capabilities for improved

Externí odkaz: http://arxiv.org/abs/2405.01533

Zobrazit plný text záznamu

Report

LITA: Language Instructed Temporal-Localization Assistant

Autor: Huang, De-An, Liao, Shijia, Radhakrishnan, Subhashree, Yin, Hongxu, Molchanov, Pavlo, Yu, Zhiding, Kautz, Jan

There has been tremendous progress in multimodal Large Language Models (LLMs). Recent works have extended these models to video input with promising instruction following capabilities. However, an important missing piece is temporal localization. The

Externí odkaz: http://arxiv.org/abs/2403.19046

Zobrazit plný text záznamu

Report

FoVA-Depth: Field-of-View Agnostic Depth Estimation for Cross-Dataset Generalization

Autor: Lichy, Daniel, Su, Hang, Badki, Abhishek, Kautz, Jan, Gallo, Orazio

Wide field-of-view (FoV) cameras efficiently capture large portions of the scene, which makes them attractive in multiple domains, such as automotive and robotics. For such applications, estimating depth from multiple images is a critical task, and t

Externí odkaz: http://arxiv.org/abs/2401.13786

Zobrazit plný text záznamu

Report

GAvatar: Animatable 3D Gaussian Avatars with Implicit Mesh Learning

Autor: Yuan, Ye, Li, Xueting, Huang, Yangyi, De Mello, Shalini, Nagano, Koki, Kautz, Jan, Iqbal, Umar

Gaussian splatting has emerged as a powerful 3D representation that harnesses the advantages of both explicit (mesh) and implicit (NeRF) 3D representations. In this paper, we seek to leverage Gaussian splatting to generate realistic animatable avatar

Externí odkaz: http://arxiv.org/abs/2312.11461

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání