Popis: |
Visual Language Navigation (VLN) is the grand goal of AI, which enables the agent to act by the language instructions from humans. In VLN task, the agent learns to search for a specific region described by the instructions in the training environments, and performs the navigation in the unseen environments. Normally, there exists a large domain gap be-tween the seen and unseen environments. Numerous works have been put on data augmentation and designing new loss in such a multi-task navigation setting. However, as a spatial and temporal searching task, a valuable signal source for the navigation – depth has not yet fully explored and thus been ignored in previous efforts. Typically, the current models lack the ability to capture the relative spatial directions to the grounding view. To address these issues, we propose an environment adaptive method based on a Depth-guided Adaptive Instance Normalization (DG-AdaIN) module to adjust the RGB features in term of the depth features, and develop a shift attention module to model the relative direct information in the attention map. Extensive experiments have validated the efficacy of our method on the benchmark dataset. |