Popis: |
We propose a framework for training language-invariant cross-modal retrieval models. We introduce four novel text encoding approaches, as well as a character-based word-embedding approach, allowing the model to project similar words across languages into the same word-embedding space. In addition, by performing cross-modal retrieval at the character level, the storage requirements for a text encoder decrease substantially, allowing for lighter and more scalable retrieval architectures. The proposed language-invariant textual encoder based on characters is virtually unaffected in terms of storage requirements when novel languages are added to the system. Contributions include new methods for building character-level-based word-embeddings, an improved loss function, and a novel cross-language alignment module that not only makes the architecture language-invariant, but also presents better predictive performance. Moreover, we introduce a module called \adapt, which is responsible for providing query-aware visual representations that generate large improvements in terms of recall for four widely-used large-scale image-text datasets. We show that our models outperform the current state-of-the-art all scenarios. This thesis can serve as a new path on retrieval research, now allowing for the effective use of captions in multiple-language scenarios. |