Multimodal story comprehension: datasets, tasks and neural models

Autor: Ravi, Hareesh
Rok vydání: 2021
DOI: 10.7282/t3-6xj3-1n36
Popis: Storytelling is a uniquely human skill that plays a central role in how we learn about and experience the world. Stories play a crucial part in the mental development of humans as it helps us encode a wide range of shared knowledge, including common sense physics, cause and effect, human psychology, and morality. We postulate that machine intelligence requires comparable skills, particularly when interacting with people. Much of the current research in understanding the visual and textual world operates only at a superficial, factual level using data that align atomic one-to-one descriptive, factual text with an image. An ideal AI system must be able to create and comprehend multimodal narratives in a causal and coherent manner, much like humans, to be able to seamlessly interact with humans. This dissertation aims to bridge the gap between current research and ideal AI systems by developing novel datasets, tasks and neural methods for true multimodal story creation and comprehension. We start by highlighting limitations of existing works such as factual and superficial alignment of image-text context, lack of coherent narrative understanding and ill-defined tasks for multimodal story comprehension. To offset these limitations, we propose a novel computational task, Story Illustration, as a measure of story comprehension by a neural model. We model textual coherence explicitly with an end-to-end trained hierarchical neural model capable of illustrating a story. Our evaluation highlights limitations of existing visual storytelling datasets. Then, we extend the formulation to a Many-to-Many setting to generalize the task and demand coherence modelling by also creating a new and improved dataset for multimodal story comprehension. We develop a machine translation approach to story illustration leveraging text generation techniques that explicitly models visual and textual coherence in stories. Further, we develop a novel dataset AESOP that enables the modelling of story comprehension and creation from a truly multimodal perspective, capturing the creative process associated with storytelling. Our framework models the evolution of textual and visual concepts on AESOP using interacting sequential networks. Our contributions lay a strong foundation for developing AI systems that can create and comprehend multimodal stories and consequently comprehend any kind of data. This dissertation along with all the publicly released resources such as data and code drives further research towards building complex and intelligent systems.
Databáze: OpenAIRE