Multimodal Transformer for Unaligned Multimodal Language Sequences
Autor: | J. Zico Kolter, Shaojie Bai, Yao-Hung Hubert Tsai, Paul Pu Liang, Ruslan Salakhutdinov, Louis-Philippe Morency |
---|---|
Jazyk: | angličtina |
Rok vydání: | 2019 |
Předmět: |
FOS: Computer and information sciences
Computer Science - Computation and Language Crossmodal Computer science Speech recognition 020206 networking & telecommunications 02 engineering and technology Article 0202 electrical engineering electronic engineering information engineering 020201 artificial intelligence & image processing Crossmodal attention Computation and Language (cs.CL) Natural language Transformer (machine learning model) Gesture |
Zdroj: | Proc Conf Assoc Comput Linguist Meet ACL (1) |
Popis: | Human language is often multimodal, which comprehends a mixture of natural language, facial gestures, and acoustic behaviors. However, two major challenges in modeling such multimodal human language time-series data exist: 1) inherent data non-alignment due to variable sampling rates for the sequences from each modality; and 2) long-range dependencies between elements across modalities. In this paper, we introduce the Multimodal Transformer (MulT) to generically address the above issues in an end-to-end manner without explicitly aligning the data. At the heart of our model is the directional pairwise cross-modal attention, which attends to interactions between multimodal sequences across distinct time steps and latently adapt streams from one modality to another. Comprehensive experiments on both aligned and non-aligned multimodal time-series show that our model outperforms state-of-the-art methods by a large margin. In addition, empirical analysis suggests that correlated crossmodal signals are able to be captured by the proposed crossmodal attention mechanism in MulT. |
Databáze: | OpenAIRE |
Externí odkaz: |