Full text

Turn on search term navigation

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

People with blindness and low vision (pBLV) encounter substantial challenges when it comes to comprehensive scene recognition and precise object identification in unfamiliar environments. Additionally, due to the vision loss, pBLV have difficulty in accessing and identifying potential tripping hazards independently. Previous assistive technologies for the visually impaired often struggle in real-world scenarios due to the need for constant training and lack of robustness, which limits their effectiveness, especially in dynamic and unfamiliar environments, where accurate and efficient perception is crucial. Therefore, we frame our research question in this paper as: How can we assist pBLV in recognizing scenes, identifying objects, and detecting potential tripping hazards in unfamiliar environments, where existing assistive technologies often falter due to their lack of robustness? We hypothesize that by leveraging large pretrained foundation models and prompt engineering, we can create a system that effectively addresses the challenges faced by pBLV in unfamiliar environments. Motivated by the prevalence of large pretrained foundation models, particularly in assistive robotics applications, due to their accurate perception and robust contextual understanding in real-world scenarios induced by extensive pretraining, we present a pioneering approach that leverages foundation models to enhance visual perception for pBLV, offering detailed and comprehensive descriptions of the surrounding environment and providing warnings about potential risks. Specifically, our method begins by leveraging a large-image tagging model (i.e., Recognize Anything Model (RAM)) to identify all common objects present in the captured images. The recognition results and user query are then integrated into a prompt, tailored specifically for pBLV, using prompt engineering. By combining the prompt and input image, a vision-language foundation model (i.e., InstructBLIP) generates detailed and comprehensive descriptions of the environment and identifies potential risks in the environment by analyzing environmental objects and scenic landmarks, relevant to the prompt. We evaluate our approach through experiments conducted on both indoor and outdoor datasets. Our results demonstrate that our method can recognize objects accurately and provide insightful descriptions and analysis of the environment for pBLV.

Details

Title
A Multi-Modal Foundation Model to Assist People with Blindness and Low Vision in Environmental Interaction
Author
Yu, Hao 1 ; Yang, Fan 1 ; Huang, Hao 1   VIAFID ORCID Logo  ; Yuan, Shuaihang 1 ; Rangan, Sundeep 1 ; John-Ross, Rizzo 2 ; Wang, Yao 1 ; Fang, Yi 3 

 Tandon School of Engineering, New York University, Brooklyn, NY 11201, USA; [email protected] (Y.H.); [email protected] (F.Y.); [email protected] (H.H.); [email protected] (S.Y.); [email protected] (S.R.); [email protected] (J.-R.R.); [email protected] (Y.W.) 
 Tandon School of Engineering, New York University, Brooklyn, NY 11201, USA; [email protected] (Y.H.); [email protected] (F.Y.); [email protected] (H.H.); [email protected] (S.Y.); [email protected] (S.R.); [email protected] (J.-R.R.); [email protected] (Y.W.); NYU Langone Health, New York University, New York, NY 10016, USA 
 Tandon School of Engineering, New York University, Brooklyn, NY 11201, USA; [email protected] (Y.H.); [email protected] (F.Y.); [email protected] (H.H.); [email protected] (S.Y.); [email protected] (S.R.); [email protected] (J.-R.R.); [email protected] (Y.W.); Electrical Engineering and Center for Artificial Intelligence and Robotics, New York University Abu Dhabi, Abu Dhabi 129188, United Arab Emirates 
First page
103
Publication year
2024
Publication date
2024
Publisher
MDPI AG
e-ISSN
2313433X
Source type
Scholarly Journal
Language of publication
English
ProQuest document ID
3059445352
Copyright
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.