Do We Really Need a Complex Agent System? Distill Embodied Agent into a Single Model

Autor:	Zhao, Zhonghan, Ma, Ke, Chai, Wenhao, Wang, Xuan, Chen, Kewei, Guo, Dongxu, Zhang, Yanting, Wang, Hongwei, Wang, Gaoang
Rok vydání:	2024
Předmět:	Computer Science - Artificial Intelligence Computer Science - Computer Vision and Pattern Recognition
Druh dokumentu:	Working Paper
Popis:	With the power of large language models (LLMs), open-ended embodied agents can flexibly understand human instructions, generate interpretable guidance strategies, and output executable actions. Nowadays, Multi-modal Language Models~(MLMs) integrate multi-modal signals into LLMs, further bringing richer perception to entity agents and allowing embodied agents to perceive world-understanding tasks more delicately. However, existing works: 1) operate independently by agents, each containing multiple LLMs, from perception to action, resulting in gaps between complex tasks and execution; 2) train MLMs on static data, struggling with dynamics in open-ended scenarios; 3) input prior knowledge directly as prompts, suppressing application flexibility. We propose STEVE-2, a hierarchical knowledge distillation framework for open-ended embodied tasks, characterized by 1) a hierarchical system for multi-granular task division, 2) a mirrored distillation method for parallel simulation data, and 3) an extra expert model for bringing additional knowledge into parallel simulation. After distillation, embodied agents can complete complex, open-ended tasks without additional expert guidance, utilizing the performance and knowledge of a versatile MLM. Extensive evaluations on navigation and creation tasks highlight the superior performance of STEVE-2 in open-ended tasks, with $1.4 \times$ - $7.3 \times$ in performance. Comment: arXiv admin note: text overlap with arXiv:2403.08282
Databáze:	arXiv
Externí odkaz:	http://arxiv.org/abs/2404.04619 Zobrazit plný text záznamu View this record from Arxiv