Recurrent Attention Network with Reinforced Generator for Visual Dialog

Autor: Linchao Zhu, Yi Yang, Hehe Fan, Fei Wu
Rok vydání: 2020
Předmět:
Zdroj: ACM Transactions on Multimedia Computing, Communications, and Applications. 16:1-16
ISSN: 1551-6865
1551-6857
DOI: 10.1145/3390891
Popis: In Visual Dialog, an agent has to parse temporal context in the dialog history and spatial context in the image to hold a meaningful dialog with humans. For example, to answer “what is the man on her left wearing?” the agent needs to (1) analyze the temporal context in the dialog history to infer who is being referred to as “her,” (2) parse the image to attend “her,” and (3) uncover the spatial context to shift the attention to “her left” and check the apparel of the man. In this article, we use a dialog network to memorize the temporal context and an attention processor to parse the spatial context. Since the question and the image are usually very complex, which makes it difficult for the question to be grounded with a single glimpse, the attention processor attends to the image multiple times to better collect visual information. In the Visual Dialog task, the generative decoder (G) is trained under the word-by-word paradigm, which suffers from the lack of sentence-level training. We propose to reinforce G at the sentence level using the discriminative model (D), which aims to select the right answer from a few candidates, to ameliorate the problem. Experimental results on the VisDial dataset demonstrate the effectiveness of our approach.
Databáze: OpenAIRE