Popis: |
Humans can process multiple perspectives of the world and generate compound information from them to obtain a richer understanding of the world. In machine learning, generative models have been widely used to comprehend the world by learning the latent representation of the given data. Representation learning is a key step in the process of data understanding, where the goal is to distill interpretable factors associated with the data. However, representation learning approaches focus on data observed in a single modality, such as text, images, or video. In this dissertation, we develop generative models that can fuse multimodal data in order to comprehensively understand it.First, we introduce a model that can learn a latent representation given multimodal data based on the VAE framework. Specifically, we disentangle the latent space into modality-common (shared) and modality-specific (private) spaces. By considering the private latent factors aside from the shared latent factors of all modalities, the proposed model can achieve more precise cross-modal generation or retrieving from one modality to another modality than the model only based on the shared latent space. We demonstrate that our model can be utilized for solving semi-supervised learning or zero-shot learning problems.We further study the importance of perceiving the world from multiple views through trajectory forecasting scenarios. The growing demand for autonomous vehicles is spurring numerous studies of behavior prediction. The existing works underestimate the context such as nearby environment and neighbors while predicting the target's future trajectories.We propose a Conditional Multimodal VAE to generate trajectory predictions conditioned on this multimodal context. We show that the proposed model is able to achieve collision-free predictions from the surrounding environment or neighbors in both real and simulated data. |