Situated and Interactive Multimodal Conversations
Autor: | Shivani Poddar, Theodore Levin, Paul A. Crook, David Whitney, Satwik Kottur, Seungwhan Moon, Ankita De, Ahmad Beirami, Rajen Subba, Eunjoon Cho, Daniel Difranco, Alborz Geramifard |
---|---|
Rok vydání: | 2020 |
Předmět: |
FOS: Computer and information sciences
Computer Science - Machine Learning Coreference Computer Science - Computation and Language Computer Science - Artificial Intelligence InformationSystems_INFORMATIONINTERFACESANDPRESENTATION(e.g. HCI) Computer science Computer Science - Human-Computer Interaction Context (language use) 02 engineering and technology 010501 environmental sciences 01 natural sciences Human-Computer Interaction (cs.HC) Machine Learning (cs.LG) Artificial Intelligence (cs.AI) Human–computer interaction Situated 0202 electrical engineering electronic engineering information engineering Benchmark (computing) 020201 artificial intelligence & image processing Dialog box Set (psychology) Computation and Language (cs.CL) Utterance 0105 earth and related environmental sciences |
Zdroj: | COLING |
DOI: | 10.18653/v1/2020.coling-main.96 |
Popis: | Next generation virtual assistants are envisioned to handle multimodal inputs (e.g., vision, memories of previous interactions, in addition to the user's utterances), and perform multimodal actions (e.g., displaying a route in addition to generating the system's utterance). We introduce Situated Interactive MultiModal Conversations (SIMMC) as a new direction aimed at training agents that take multimodal actions grounded in a co-evolving multimodal input context in addition to the dialog history. We provide two SIMMC datasets totalling ~13K human-human dialogs (~169K utterances) using a multimodal Wizard-of-Oz (WoZ) setup, on two shopping domains: (a) furniture (grounded in a shared virtual environment) and, (b) fashion (grounded in an evolving set of images). We also provide logs of the items appearing in each scene, and contextual NLU and coreference annotations, using a novel and unified framework of SIMMC conversational acts for both user and assistant utterances. Finally, we present several tasks within SIMMC as objective evaluation protocols, such as Structural API Prediction and Response Generation. We benchmark a collection of existing models on these SIMMC tasks as strong baselines, and demonstrate rich multimodal conversational interactions. Our data, annotations, code, and models are publicly available. Comment: 20 pages, 5 figures, 11 tables, accepted to COLING 2020 |
Databáze: | OpenAIRE |
Externí odkaz: |