A Multiview Approach to Learning Articulated Motion Models
Autor: | Thomas M. Howard, Matthew R. Walter, Andrea F. Daniele |
---|---|
Rok vydání: | 2019 |
Předmět: |
Computer science
02 engineering and technology Kinematics Object (computer science) Motion (physics) Multimodal learning 03 medical and health sciences 0302 clinical medicine Human–computer interaction Feature (computer vision) 030221 ophthalmology & optometry 0202 electrical engineering electronic engineering information engineering 020201 artificial intelligence & image processing Graphical model Representation (mathematics) Natural language |
Zdroj: | Springer Proceedings in Advanced Robotics ISBN: 9783030286187 ISRR |
Popis: | In order for robots to operate effectively in homes and workplaces, they must be able to manipulate the articulated objects common within environments built for and by humans. Kinematic models provide a concise representation of these objects that enable deliberate, generalizable manipulation policies. However, existing approaches to learning these models rely upon visual observations of an object’s motion, and are subject to the effects of occlusions and feature sparsity. Natural language descriptions provide a flexible and efficient means by which humans can provide complementary information in a weakly supervised manner suitable for a variety of different interactions (e.g., demonstrations and remote manipulation). In this paper, we present a multimodal learning framework that incorporates both vision and language information acquired in situ to estimate the structure and parameters that define kinematic models of articulated objects. The visual signal takes the form of an RGB-D image stream that opportunistically captures object motion in an unprepared scene. Accompanying natural language descriptions of the motion constitute the linguistic signal. We model linguistic information using a probabilistic graphical model that grounds natural language descriptions to their referent kinematic motion. By exploiting the complementary nature of the vision and language observations, our method infers correct kinematic models for various multiple-part objects on which the previous state-of-the-art, visual-only system fails. We evaluate our multimodal learning framework on a dataset comprised of a variety of household objects, and demonstrate a \(23\%\) improvement in model accuracy over the vision-only baseline. |
Databáze: | OpenAIRE |
Externí odkaz: |