Autor: |
Gkanatsios, Nikolaos, Pitsikalis, Vassilis, Koutras, Petros, Zlatintsi, Athanasia, Maragos, Petros |
Rok vydání: |
2019 |
Předmět: |
|
Druh dokumentu: |
Working Paper |
Popis: |
Detecting visual relationships, i.e. triplets, is a challenging Scene Understanding task approached in the past via linguistic priors or spatial information in a single feature branch. We introduce a new deeply supervised two-branch architecture, the Multimodal Attentional Translation Embeddings, where the visual features of each branch are driven by a multimodal attentional mechanism that exploits spatio-linguistic similarities in a low-dimensional space. We present a variety of experiments comparing against all related approaches in the literature, as well as by re-implementing and fine-tuning several of them. Results on the commonly employed VRD dataset [1] show that the proposed method clearly outperforms all others, while we also justify our claims both quantitatively and qualitatively. |
Databáze: |
arXiv |
Externí odkaz: |
|