Full text

Turn on search term navigation

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

In the field of embodied AI, vision-and-language navigation (VLN) is a crucial and challenging multi-modal task. Specifically, outdoor VLN involves an agent navigating within a graph-based environment, while simultaneously interpreting information from real-world urban environments and natural language instructions. Existing outdoor VLN models predict actions using a combination of panorama and instruction features. However, these methods may cause the agent to struggle to understand complicated outdoor environments and ignore the details in the environments to fail to navigate. Human navigation often involves the use of specific objects as reference landmarks when navigating to unfamiliar places, providing a more rational and efficient approach to navigation. Inspired by this natural human behavior, we propose an object-level alignment module (OAlM), which guides the agent to focus more on object tokens mentioned in the instructions and recognize these landmarks during navigation. By treating these landmarks as sub-goals, our method effectively decomposes a long-range path into a series of shorter paths, ultimately improving the agent’s overall performance. In addition to enabling better object recognition and alignment, our proposed OAlM also fosters a more robust and adaptable agent capable of navigating complex environments. This adaptability is particularly crucial for real-world applications where environmental conditions can be unpredictable and varied. Experimental results show our OAlM is a more object-focused model, and our approach outperforms all metrics on a challenging outdoor VLN Touchdown dataset, exceeding the baseline by 3.19% on task completion (TC). These results highlight the potential of leveraging object-level information in the form of sub-goals to improve navigation performance in embodied AI systems, paving the way for more advanced and efficient outdoor navigation.

Details

Title
Outdoor Vision-and-Language Navigation Needs Object-Level Alignment
Author
Sun, Yanjun 1   VIAFID ORCID Logo  ; Qiu, Yue 2   VIAFID ORCID Logo  ; Aoki, Yoshimitsu 3   VIAFID ORCID Logo  ; Kataoka, Hirokatsu 2   VIAFID ORCID Logo 

 Department of Electronics and Electrical Engineering, Faculty of Science and Technology, Keio University, 3-14-1, Hiyoshi, Kohoku-ku, Yokohama 223-8522, Japan; [email protected]; National Institute of Advanced Industrial Science and Technology (AIST), Tsukuba 305-8560, Japan; [email protected] (Y.Q.); [email protected] (H.K.) 
 National Institute of Advanced Industrial Science and Technology (AIST), Tsukuba 305-8560, Japan; [email protected] (Y.Q.); [email protected] (H.K.) 
 Department of Electronics and Electrical Engineering, Faculty of Science and Technology, Keio University, 3-14-1, Hiyoshi, Kohoku-ku, Yokohama 223-8522, Japan; [email protected] 
First page
6028
Publication year
2023
Publication date
2023
Publisher
MDPI AG
e-ISSN
14248220
Source type
Scholarly Journal
Language of publication
English
ProQuest document ID
2836484222
Copyright
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.