Robin3D: Improving 3D Large Language Model via Robust Instruction Tuning

Autor:	Kang, Weitai, Huang, Haifeng, Shang, Yuzhang, Shah, Mubarak, Yan, Yan
Rok vydání:	2024
Předmět:	Computer Science - Artificial Intelligence Computer Science - Computation and Language Computer Science - Computer Vision and Pattern Recognition
Druh dokumentu:	Working Paper
Popis:	Recent advancements in 3D Large Language Models (3DLLMs) have highlighted their potential in building general-purpose agents in the 3D real world, yet challenges remain due to the lack of high-quality robust instruction-following data, leading to limited discriminative power and generalization of 3DLLMs. In this paper, we introduce Robin3D, a powerful 3DLLM trained on large-scale instruction-following data generated by our novel data engine, Robust Instruction Generation (RIG) engine. RIG generates two key instruction data: 1) the Adversarial Instruction-following data, which features mixed negative and positive samples to enhance the model's discriminative understanding. 2) the Diverse Instruction-following data, which contains various instruction styles to enhance model's generalization. As a result, we construct 1 million instruction-following data, consisting of 344K Adversarial samples, 508K Diverse samples, and 165K benchmark training set samples. To better handle these complex instructions, Robin3D first incorporates Relation-Augmented Projector to enhance spatial understanding, and then strengthens the object referring and grounding ability through ID-Feature Bonding. Robin3D consistently outperforms previous methods across five widely-used 3D multimodal learning benchmarks, without the need for task-specific fine-tuning. Notably, we achieve a 7.8\% improvement in the grounding task (Multi3DRefer) and a 6.9\% improvement in the captioning task (Scan2Cap). Comment: 10 pages
Databáze:	arXiv
Externí odkaz:	http://arxiv.org/abs/2410.00255 Zobrazit plný text záznamu View this record from Arxiv