Bio-inspired Vision Mapping and Localization Method Based on Reprojection Error Optimization and Asynchronous Kalman Fusion

Abstract

Bio-inspired visual systems have garnered significant attention in robotics owing to their energy efficiency, rapid dynamic response, and environmental adaptability. Among these, event cameras—bio-inspired sensors that asynchronously report pixel-level brightness changes called 'events', stand out because of their ability to capture dynamic changes with minimal energy consumption, making them suitable for challenging conditions, such as low light or high-speed motion. However, current mapping and localization methods for event cameras depend primarily on point and line features, which struggle in sparse or low-feature environments and are unsuitable for static or slow-motion scenarios. We addressed these challenges by proposing a bio-inspired vision mapping and localization method using active LED markers (ALMs) combined with reprojection error optimization and asynchronous Kalman fusion. Our approach replaces traditional features with ALMs, thereby enabling accurate tracking under dynamic and low-feature conditions. The global mapping accuracy significantly improved by minimizing the reprojection error, with corner errors reduced from 16.8 cm to 3.1 cm after 400 iterations. The asynchronous Kalman fusion of multiple camera pose estimations from ALMs ensures precise localization with a high temporal efficiency. This method achieved a mean translation error of 0.078 m and a rotational error of 5.411° while evaluating dynamic motion. In addition, the method supported an output rate of 4.5 kHz while maintaining high localization accuracy in UAV spiral flight experiments. These results demonstrate the potential of the proposed approach for real-time robot localization in challenging environments.

Full text

Translate

Turn on search term navigation

Introduction

Bio-inspired visual systems for robotics have emerged as a focused research area owing to their high energy efficiency, rapid dynamic response capabilities, and broad environmental adaptability. Biological visual systems, such as insects and humans, possess unique characteristics; these can perceive real-time changes in the surrounding environment with extremely low energy consumption and respond rapidly to dynamic events. This highly efficient visual mechanism offers valuable insights for the development of robotic vision systems [1]. Event cameras have recently emerged as novel bio-inspired visual sensors and gained increasing attention [2]. Unlike traditional frame-based cameras, event cameras record dynamic information in a non-uniform manner by detecting pixel-wise brightness changes in the scene (referred to as “events”). This approach not only significantly reduces data transmission and computational complexity but also maintains robust perceptual performance in challenging environments, such as low-light or high-dynamic settings.

Current mapping and localization methods for bio-inspired event cameras depend predominantly on precise point features by traditional extraction [3, 4, 5, 6–7] and learning extraction [8, 9] and line features [10, 11, 12, 13, 14–15]. However, these approaches often fail to satisfy the demands of robotic applications. In particular, it is challenging to extract and track numerous features in environments with sparse or low-feature conditions (Figure 1). Furthermore, because of the inherent limitations of event cameras, which capture only brightness changes, these methods are unsuitable for static or slow-motion scenarios.

[See PDF for image]

Figure 1

Comparison of our method and the featured-based methods in a lack-featured scene. Sufficient amount features could not be extracted by the keypoint extraction method [3], whereas our detected features of ALMs are precise and robust

Hence, an active LED marker (ALM) has been introduced for event cameras [16, 17, 18–19], which is a reliable and robust alternative to traditional point and line features. ALM can be detected and tracked with high accuracy even in environments characterized by rapid motion and challenging lighting conditions. However, a single ALM is insufficient to provide large-scale global localization owing to the significant decay in the intensity of the ALM LED spot with distance. Moreover, the spatial arrangements [18] and frequency-based methods [16, 17, 19] used to identify different ALMs are insufficient to support the numerous ALMs required for effective mapping and localization in large-scale environments. Based on our previous approach, which achieved relative pose estimation between a single ALM and an event camera, mapping and global localization are presented in this work. It solves the sufficient and unstable features extracted from current methods in sparse environments and provides refined large-scale global localization with high accuracy and temporal efficiency.

In this study, we proposed a bio-inspired vision mapping and localization method based on reprojection error optimization and asynchronous Kalman fusion. We replaced the traditional point or line features with a robust and reliable ALM to overcome the limitations of feature-based methods in sparse or low-feature environments. Global mapping was achieved with a high accuracy through reprojection error minimization. Furthermore, optimization converges rapidly owing to the high temporal efficiency of the ALM detection method. The absolute corner error of the map decreased from 16.8 cm to 3.1 cm after 400 iterations of reprojection error optimization. In addition, covariance propagation within the pose transformation chain was incorporated to maintain a high localization accuracy. The asynchronous Kalman fusion enables the fusion of asynchronous multiple-camera pose estimations from different ALMs in µs. Consequently, asynchronous global localization with high precision and temporal efficiency was achieved. We evaluated our method for dynamic motion within low-feature environments and during spiral UAV flights. Compared to the mono-event-based method, our approach outperforms EVO [20], achieving a mean translation error of 0.078 m compared to 0.323 m for EVO, and a rotational error of 5.411° versus 25.834° for EVO. The maximum estimated output rate reached 4.5 kHz in the UAV spiral flight experiment while maintaining a high localization accuracy. These results demonstrate the potential of our method for robot tasks in challenging illumination and dynamic environments, as illustrated in Figure 2. The main contributions of this study are summarized as follows:

Robust ALM Feature for Sparse Environments: Traditional point and line features are replaced with robust ALMs to address the limitations of sparse or low-feature scenarios. ALMs enable reliable detection and tracking under challenging conditions (i.e., rapid motion and low light intensity), thereby achieving refined global mapping and localization.
High-Accuracy Mapping based on Reprojection Error Minimum: The Reprojection error minimization was introduced to achieve global mapping with high precision. This optimization effectively reduced the absolute corner error of the map to 3.1 cm after 400 iterations.
Microsecond-Level Localization through Asynchronous Kalman Fusion: An asynchronous Kalman filter was utilized to fuse multi-ALM pose estimates at the µs level. Our method ensures consistent localization accuracy by propagating covariance through the pose transformation chain, achieving a mean translation error of 0.078 m and an output rate up to 4.5 kHz in demanding scenarios.

[See PDF for image]

Figure 2

Application for our proposed method. The UAV is flying from the low illumination Phase I) to the high dynamic range in illumination changing Phase II), and finally the high illumination Phase III)

Related Work

Bio-inspired Event Camera

An event camera is a novel imaging device inspired by biological visual systems that can capture motion information in scenes with extremely low data rates and a high dynamic range. Its working principle resembles the rod and cone cells in the human eye, initiating “events” (event) with a change in the light intensity. This enables the efficient encoding of moving objects within the scene. Unlike traditional frame-based cameras, an event camera asynchronously records the temporal evolution of the intensity change in each pixel using timestamps [2]. The encoded information is expressed as $e_{k} = (u_{k}, p_{k}, t_{k})$ , representing the event captured in time $t_{k}$ , with pixel location $u_{k} = (x_{k}, y_{k})$ and porality $p_{k} \in {- 1, + 1}$ . The polarity $p_{k}$ indicates the direction of brightness change. When the difference in brightness, $Δ L (u_{k}, t_{k})$ , which $L (u_{k}) = log (I (u_{k}))$ representing the logarithmic intensity of brightness, exceeds the threshold $C$ , an $e_{k}$ is recorded by the event camera assigned to the pixel location $u_{k} = (x_{k}, y_{k})$ :

Δ L (u_{k}, t_{k}) = L (u_{k}, t_{k}) - L (u_{k}, t_{k - 1}) = p_{k} C,

where

t_{k - 1}

is the last recorded timestamp when it triggers an event at the pixel

u_{k}

. An event camera is highly advantageous for processing high-speed motion and low-light conditions, and reducing data redundancy due to its high temporal dynamic sensing ability.

Mapping and Localization for Event Camera

Feature-based mapping and localization methods for event cameras typically involve two main stages: feature extraction and tracking and camera tracking and mapping [21]. Robust features invariant to motion, noise, and lighting changes are expected during feature extraction. Improved extracted features can result in refined mapping and localization of dynamic motion and challenging illumination.

The point and line features are extracted and subsequently tracked to estimate the camera poses and three-dimensional landmarks. For instance, the eHarris method [4] is utilized by the derivative filter of the Harris detector [5], and the CHEC method [6] applies corner detection by filtering and incrementally updating the Harris score. Other approaches, such as eFAST [22], utilize the FAST detector [23] and Arc * [7] algorithms to optimize the detection efficiency using derivative filters and expanding circular arcs, respectively. Harris and FAST detectors have been extended to several event representations including edge maps [24, 25] and motion-compensated event frames [26].

For instance, Ref. [25] applied Harris to edge maps, whereas Ref. [24] used FAST to select corner candidates based on the highest Shi-Tomasi score. Motion-compensated frames were generated over extended time windows to reduce motion blur before applying FAST for corner detection in Refs. [12] and [26]. However, these methods are sensitive to motion and noise. This problem can be overcome using learning-based techniques [8, 27] to model motion-variant patterns and noise. For example, Ref. [27] used a speed-invariant time surface with a random forest for corner prediction, whereas Ref. [8] employed ConvLSTM [9] to predict gradients from event data, with supervision from brightness images, before applying the Harris detector. Ref. [28] directly extracted spatiotemporal corner features by projecting asynchronous brightness change events from dual fisheye sensors, enabling real-time landmark detection for fast drone tracking. Similarly, Ref. [29] proposed a continuous-time stereo VO pipeline that preserved the native temporal resolution of asynchronous event streams by leveraging Gaussian process regression with a physically motivated prior. Ref. [30] introduced ESVIO, an event-based stereo visual–inertial odometry framework that fused asynchronous event streams, synchronized stereo images, and IMU data into a direct pipeline, delivering a robust, real-time 6-DOF state estimation. Event-based stereo VO was achieved by integrating gyroscope preintegration for yaw observability and an edge-pixel sampling strategy guided by local event dynamics, yielding improved mapping and tracking [31]. ESVO2 [32] was used to build upon direct event-based stereo VO by embedding IMU measurements as motion priors via a compact continuous-time back-end and contour-point sampling, causing efficient mapping and accurate pose tracking at high resolutions and in large-scale environments.

In addition, line-based features were extracted because event cameras naturally respond to moving edges and often capture abundant line features. Classical methods, such as the Hough transform and line segment detector (LSD) [10], are commonly employed. Other approaches leverage the spatiotemporal properties of the event data [33] or use external IMU data to cluster events in vertical bins [34]. For example, a study [35] applied the Hough transform by projecting events into the Hough space for line detection. The spiking Hough transformation [36] enhanced this using spiking neurons in the Hough space to identify lines while extending this technique to a 3D point map to group events into 3D lines. The EVO approach [20] presents a purely geometric method for aligning event data based on edge patterns. The camera-tracking module converted a sequence of events into an edge map, which was subsequently aligned with the reference frame built from the reprojection of the 3D map. The PL-VIO approach [14] performs line extraction by directly applying the LSD to motion-compensated event data. The event-based line segment detector (ELiSeD) [37] computes event orientations using a Sobel filter and groups events with similar angles into line support regions. Similarly, an optical flow was used to compute event orientations and assign events with similar distances and angles to line clusters [11]. In addition, a plane-fitting algorithm was used to cluster events triggered by a moving line and estimate the line via the cross product of the optical flow and the plane normal [33]. Another method [34] aligned event data with the direction of gravity and clustered the events into vertical bins to extract the vertical lines.

Moreover, recent learning-based approaches to event-camera pose estimation exploits the unique spatiotemporal characteristics of asynchronous event streams to train deep networks for robust motion and tracking tasks. A study [38] introduced PEPNet, a lightweight point-based architecture that directly ingested raw event point clouds, hierarchically extracted spatial features, and employed an attentive Bi-LSTM for explicit temporal encoding, achieving 6-DOF relocalization with minimal parameters. Another study [39] proposed ES-PTAM, a stereo visual–odometry framework combining a correspondence-free depth-mapping module based on ray-density fusion with an edge-map-alignment tracker, outperforming the existing VO systems on diverse real-world datasets. A contrast-maximization extension that learned dense optical flow, depth, and ego-motion directly from events, embedding space–time event alignment within an end-to-end optimization network was developed [40]. An event-based stereo 3D mapping and tracking pipeline for autonomous vehicles was presented by integrating a DSI-based depth-fusion module with a learned tracking network to deliver robust mapping and pose estimates in driving scenarios [41]. Another study [42] released EventVOT, the first high-resolution (1280 × 720) event-tracking benchmark, and proposed a hierarchical knowledge-distillation tracker that trains a transformer-based teacher on RGB input and distills it into an event-only student for high-speed object tracking. An event-based head-pose estimation network with novel event spatial–temporal fusion and motion–perceptual attention modules backed by two large, diverse datasets was introduced to tackle extreme motion and lighting in head-pose regression [43].

Despite these advancements, feature and learning-based methods face challenges in sparse feature scenarios, making the use of natural features less effective.

Hence, we utilized ALMs to replace the natural features extracted for mapping and localization. Previous studies [44, 45] introduced active ArUco markers integrating multiple LEDs, whose blinking patterns can be detected by an event camera in complex scenarios (fast motion, low-light, and high-contrast conditions). However, output rates are restricted by the complex detection and tracking of ArUco markers. Active LED markers [16, 17, 18–19], based on spatial arrangements and frequency are insufficient for supporting numerous ALMs required for effective mapping and localization in large-scale environments. Therefore, the limitations of current methods struggling in sparse or low-feature environments can be addressed by mapping and global localization of large-scale methods. Although our previous work introduced binary temporal-interval encoding of a single ALM and demonstrated µs-level pose estimations with high precision, this study has the following extensions and improvements:

The 3D positions of all ALMs and camera trajectories were jointly optimized via reprojection error minimization, yielding a globally consistent marker map.
An asynchronous Kalman filter method that integrates pose estimates from multiple ALMs in real time was proposed.

These advances enable large-scale mapping and robust global localization under sparse-feature and low-contrast conditions, and even under high-speed and low-illumination conditions. In addition, we evaluated both a handheld motion environment and high-speed spiral flight, demonstrating the potential of our proposed method for real-time robot global localization in challenging environments.

In summary, the algorithm overview according to the timeline is provided in Figure 3.

[See PDF for image]

Figure 3

The algorithm overview graph is distributed according to the timeline. The algorithms are divided into four parts: point-based, line-based, learning-based methods, and active LED markers, which are represented by blue, red, orange, and green, respectively

Method

Overview

This paper presents a framework for event-based mapping and real-time localization using ALMs, as shown in Figure 4. The proposed system operates using three tightly coupled components: marker detection, global mapping, and localization. ALMs were first detected and tracked using temporally modulated blinking patterns, enabling robust identification and 6-DoF pose estimation. A global marker map is constructed by optimizing the reprojection error across multi-view observations and leveraging camera poses and 3D marker coordinates in a bundle adjustment formulation. Finally, real-time camera localization is achieved by fusing asynchronous pose estimates derived from multiple ALMs by employing the Lie group covariance propagation and an adaptive asynchronous Kalman filtering strategy.

[See PDF for image]

Figure 4

Overview of the proposed method: A global map is first refined via reprojection error minimization. Then, multiple ALM estimations are fused using our asynchronous Kalman optimization to achieve high-accuracy, microsecond (µs)-level global localization

Active LED Marker Detection

Based on our previous work, an ALM with a unique ID can be detected and tracked through temporal interval modulation between blinks. The 6 DoF pose of the ALM was obtained by solving the PnP algorithm, with the observed pixel position $u^{\land}$ , and the 3D position c of ALM LEDs. The observed pixel position $u^{\land}$ and the 3D position c were utilized to optimize the reprojection error and obtain the precise global mapping.

Mapping Based on Reprojection Error Minimum

Although the principle of minimizing the reprojection error is fundamental to 3D reconstruction and is a core component of bundle adjustment (BA) in traditional visual SLAM (VSLAM) [46], its application in our event-based system with ALMs presents unique characteristics and advantages. Unlike the BA in frame-based VSLAM, which typically depends on passively observed natural features, our method leverages the high temporal resolution of event data to track active signaling and uniquely identifiable ALMs. This provides sparse yet highly reliable and precise constraints for optimization, which is particularly beneficial in challenging scenarios such as high-speed motion or low-feature environments where traditional methods fail.

Global marker mapping can be estimated by the continuous 3D camera pose from different markers while being optimized by the reprojection error minimum, as shown in Figure 5. The $M$ is defined as the set of detected ALMs, and $M_{t}$ as the set of markers detected in time $t$ . The $c_{i, j}$ is used as the 3D point of ALM LED $j$ , ALM $i$ in the global reference system. The mapping demands at least two ALMs detected, i.e., $| M_{t} | > 1$ . The projection of LED $j$ , ALM $i$ can be obtained by defining the function as

u_{i, j, t} = π (K, T_{t}, c_{i, j}),

where

K

is the camera instrinsic matrix, defined as:

K = [\begin{matrix} f_{x} & 0 & c_{x} \\ 0 & f_{y} & c_{y} \\ 0 & 0 & 1 \end{matrix}] .

[See PDF for image]

Figure 5

Mapping process based on reprojection error minimization. The 3D LED points $c_{i, j}$ on the ALMs in the world coordinate system $(X_{w}, Y_{w}, Z_{w})$ are projected onto the 2D image plane as observed pixel points ${\hat{u}}_{i, j, t}$ , based on the camera pose $T_{t}$ and camera intrinsics $K$ . The global map is optimized by minimizing the reprojection error $e_{i, t}$ by adjusting the camera poses $T_{t}$ and the 3D coordinates of the ALMs $c_{i, j}$ .

A globally consistent map of ALMs was established and camera poses were refined using an optimization strategy based on minimizing the reprojection error of the observed ALM LEDs. This is conceptually similar to the BA used in VSLAM systems. However, the nature of our input data, that is, asynchronous events from precisely detected ALMs, distinguishes our approach from the BA applied to frame-based natural features.

$T_{t}$ denotes the matrix that transforms a 3D point from the global reference system to the camera reference system at time $t$ . Thus, the reprojection error of ALM $i$ detected at time $t$ is obtained by comparing the observed projections of ALM LED ${\hat{u}}_{i, j, t}$ and is predicated as

e_{i, t} = \sum_{j = 1} {(u_{i, j, t} - {\hat{u}}_{i, j, t})}^{2} = {(π (K, T_{t}, c_{i, j}) - {\hat{u}}_{i, j, t})}^{2} .

The total reprojection error in whole temporal domain could be expressed as:

F = \sum_{t} \sum_{i \in M_{t}} e_{i, t} .

We deployed the LM algorithm [47] to optimize $T_{t}$ to minimize the total reprojection error $F$ . Optimization can be achieved as follows:

T_{m + 1} = T_{m} - {(J^{T} J + λ I)}^{- 1} J^{T} F,

where

J

is the Jacobian of

F

λ

is a nonnegative scalar, and

I

is the identity matrix.

λ

is dynamically updated at each iteration

m

, as shown in Algorithm 1.

[See PDF for image]

Algorithm 1

Mapping based on reprojection error minimum

The efficacy of this optimization stems directly from the characteristics of event-based ALM detection. The high temporal resolution and asynchronous nature of event data allow capturing ALM LED signals with less motion blur and high precision, even during rapid sensor movements. Furthermore, the unique temporal encoding of ALMs ensures robust identification and provides consistent, unambiguous data points ( ${\hat{u}}_{i, j}$ ) as inputs for optimization. This contrasts with traditional BA, which must contend with potential ambiguities, outliers, and illumination sensitivity associated with natural feature extraction from intensity frames. Consequently, our re-projection optimization benefits from high-quality, high-frequency constraints, causing efficient convergence and accurate map estimation.

Localization Through Asynchronous Kalman Fusion

Real-time camera localization within the global reference frame is achieved, building on the optimized global marker map, through the probabilistic fusion of asynchronous pose estimates from multiple ALMs. The proposed method rigorously accounts for transformation chain uncertainties via the Lie group covariance propagation and employs an asynchronous Kalman filter to sequentially integrate multi-source observations.

State Representation and Observation Model

The camera pose in the global reference system constitutes the system state vector:

x_{τ} = [t_{W}, q_{W}] \in R^{7},

where

t_{W} \in R^{3}

denotes the positional coordinates, and

q_{W} \in R^{4}

is the orientation as a unit quaternion in time

τ

. Observations from the

i th

ALM adhered to the nonlinear measurement model:

z_{t}^{(i)} = h^{(i)} (x_{t}) + v_{t}^{(i)}, v_{t}^{(i)} \sim N (0, R_{k}^{(i)}),

where

h^{(i)} (\cdot)

maps the global pose to the ALM-relative observation space, and

R^{(i)}

quantifies the measurement noise covariance.

Covariance Propagation

The estimated camera pose in the global reference system was transformed from that of the camera and marker reference systems. Precise localization was obtained by considering and optimizing the covariance model of the pose-transformation chain.

The global pose estimate derives from the transformation chain:

T_{W \leftarrow C} = T_{W \leftarrow M_{i}} \cdot T_{M_{i} \leftarrow C},

where

T_{W \leftarrow C}

represents the transformation from the camera reference system to the global reference system, and

T_{W \leftarrow M_{i}}

and

T_{M_{i} \leftarrow C}

represent the transformations from the ALM

i

reference system to the global reference system and from the camera reference system to the ALM

i

reference system, respectively.

We defined the perturbation term to model the uncertainties in the transformation process as follows:

δ T = exp (ξ^{\land}), ξ \sim N (0, Σ),

where

ξ

represents the error in the transformation, modeled as a random variable following a Gaussian distribution with zero mean and covariance

Σ

[See PDF for image]

Algorithm 2

Localization through asynchronous Kalman fusion

Covariance propagation for the transformed pose was derived using the Lie algebraic perturbation model. The complete propagation equation is

Σ_{W \leftarrow C} = J_{1} Σ_{W \leftarrow M_{i}} J_{1}^{T} + J_{2} Σ_{M_{i} \leftarrow C} J_{2}^{T},

where

Σ_{W \leftarrow C}

is the propagated covariance in the global reference system, and

Σ_{W \leftarrow M_{i}}

and

Σ_{M_{i} \leftarrow C}

represent the covariances associated with the transformations

T_{W \leftarrow M_{i}}

and

T_{M_{i} \leftarrow C}

, respectively. The Jacobian matrices

J_{1}

and

J_{2}

describe how the covariances in the intermediate transformations influence the propagated covariance. Appendix A presents a detailed mathematical proof of this covariance propagation.

The adjoint operation in the Lie group $S E (3)$ was used. Specifically, the Jacobian matrices are expressed by

J_{1} = Ad (T_{M_{i} \leftarrow C}), J_{2} = I_{6},

where

A d (T)

denotes the adjoint matrix of the transformation

T

, and

I_{6}

is the

6 \times 6

identity matrix.

The adjoint matrix $A d (T)$ is defined as:

Ad (T) = [\begin{matrix} R & t^{\land} R \\ 0 & R \end{matrix}] \in R^{6 \times 6},

where

R

is the rotation matrix,

t

is the translation vector, and

t^{\land}

is the skew–symmetric matrix of the translation vector

t

. The adjoint matrix captures the effect of the transformation on the propagation of uncertainty in both rotation and translation.

Thus, the Jacobian matrix $J_{1}$ is the adjoint of the transformation $T_{M_{i} \leftarrow C}$ , and $J_{2}$ , being the identity matrix, indicates that the covariance associated with $T_{M_{i} \leftarrow C}$ is directly propagated without any transformation.

Asynchronous Kalman Fusion

A constant-velocity model for system dynamics was adopted to integrate the propagation model into a filtering framework as follows:

x_{τ} = f (x_{t_{prev}}, Δ t) + w_{τ},

where

x_{τ}

represents the system state at time

τ

, and

w_{τ} \sim N (0, Q_{τ})

is the process noise modeled as a zero-mean Gaussian random variable with covariance

Q_{τ}

. The function

f (\cdot)

describes the state transition and is expressed by:

f (\cdot) = [\begin{matrix} t_{W} + v Δ t \\ q_{W} \otimes exp (ω Δ t) \end{matrix}],

when an observation from marker

i

is received at time

τ

, the filtering process begins with predicting the system’s state at the time of observation. This prediction is based on the system’s dynamic model, where the state is propagated from the previous time step

t_{prev}

to the current time

τ

. The predicted state

x_{τ}^{-}

is obtained as follows:

x_{τ}^{-} = f (x_{t_{prev}}, τ - t_{prev}) .

The corresponding predicted state covariance $P_{τ}^{-}$ is computed by propagating the previous covariance $P_{t_{prev}}$ through the state transition matrix $F_{τ}$ , and adding the process noise covariance $Q_{τ}$ as follows:

P_{τ}^{-} = F_{τ} P_{t_{prev}} F_{τ}^{T} + Q_{τ} .

Once the prediction was obtained, the Jacobian observation matrix was computed. This matrix represents the partial derivative of the observation function $h^{(i)}$ , which describes the relationship between the predicted state $x_{τ}^{-}$ and the actual measurements. The Jacobian matrix $H_{k}^{(i)}$ is evaluated at the predicted state $x_{τ}^{-}$ :

H_{k}^{(i)} = \frac{\partial h^{(i)}}{\partial x} |_{x_{τ}^{-}} .

A detailed derivation of the Jacobian matrix $H_{k}^{(i)}$ is provided in Appendix A. Subsequently, the Kalman gain, $K_{k}$ was calculated. The Kalman gain determines the weight assigned to the innovation (the difference between the observed and predicted values) when updating the state estimate. The Kalman gain was computed as follows:

K_{k} = P_{τ}^{-} H_{k}^{(i) T} {(H_{k}^{(i)} P_{τ}^{-} H_{k}^{(i) T} + R_{k}^{(i)})}^{- 1} .

Finally, the state estimate was updated by incorporating the innovation scaled by the Kalman gain. This adjustment modifies the predicted state $x_{τ}^{-}$ using the new measurement $z_{k}^{(i)}$ , resulting in the updated state $x_{τ}^{+}$ . In addition, the updated state covariance $P_{τ}^{+}$ was computed to reflect the reduced uncertainty after the update:

x_{τ}^{+} = x_{τ}^{-} + K_{k} (z_{k}^{(i)} - h^{(i)} (x_{τ}^{-})),

P_{τ}^{+} = (I - K_{k} H_{k}^{(i)}) P_{τ}^{-} .

This completed asynchronous update, where the new measurement was incorporated into the state estimate and the uncertainty (covariance) was adjusted accordingly.

For concurrent observations from $N$ ALMs, the optimal fusion results are as follows:

K_{total} = P^{-} {(\sum_{i = 1}^{N} H_{i}^{T} R_{i}^{- 1} H_{i})}^{- 1},

δ x = K_{total} \sum_{i = 1}^{N} H_{i}^{T} R_{i}^{- 1} (z^{(i)} - h^{(i)} (x^{-})) .

This strategy allows the fusion of multiple asynchronous measurements by weighting them according to their respective uncertainties, thereby ensuring an optimal estimate of state $x$ , as shown in Algorithm 2.

Experiment

The proposed method was evaluated in both a handheld motion environment and spiral flight. The event camera (Prophesee EVK4; 1280 × 720 pixels, IMX636ES) was manually moved in a low-texture environment, and the localization results obtained from our method were compared with those of the mono-event-based localization method, EVO [20].

In addition, experiments were conducted using a self-designed UAV equipped with a forward-looking event camera (Prophesee EVK4; 1280 $\times$ 720 pixels, IMX636ES) and an Intel NUC14MNK15 computer running on Ubuntu 22.04. The UAV followed a spiral flight path with a radius of 1.2 m and a varying altitude ranging from 0.2 m to 1.2 m. The event camera continuously captured ALMs from the central region of the flight path, which was previously mapped, as illustrated in Figure 6. The commercial infrared 3D tracking system Opti Track was used as the ground truth for comparative analysis.

[See PDF for image]

Figure 6

Visualization of the experiment setup of spiral flight and ALMs. The UAV is flying in a spiral path, with a radius of 1.2 m, and varying height from 0.2 m to 1.2 m, carrying the event camera capturing the central region where a total of 10 ALMs are deployed on a triangular prism

Evaluation

Global Mapping Accuracy Through Reprojection Error Optimization

High-precision global mapping was achieved based on reprojection error optimization of the proposed method. The evolution of the reprojection error in the proposed method is shown in Figure 7. As shown, the reprojection error initially decreased rapidly before gradually leveling off. At the start of the optimization, the absolute corner error (ACE) was 16.8 cm. After 400 iterations, the ACE improved to 3.1 cm, demonstrating that the global mapping was highly accurate, thus facilitating subsequent high-precision fused localization.

[See PDF for image]

Figure 7

Evolution of reprojection error in our optimized process. The absolute corner error (ACE) has been shown for certain iterations. The reprojection error initially decreases rapidly before gradually leveling off

The large number of estimated camera poses within a short period is effectively utilized because of the high temporal efficiency of our ALM detection approach, accelerating the convergence of the reprojection error optimization iterations. Consequently, compared with frame-based algorithms, our method requires less computational time while maintaining superior accuracy.

Localization Accuracy in Dynamic Less-Featured Scene

The localization performances of EVO [20] and our method in a low-feature, dynamic environment were compared to demonstrate the enhanced performance of our approach, with the ground truth provided by OptiTrack. EVO [20] has achieved highly accurate localization unaffected by motion blur and in high-dynamic-range conditions utilizing purely geometric methods based on edge patterns, which depend only on single-event camera data, as in our proposed method. The captured view was a black plane with six ALMs, where the edges of the ALMs and the plane provided a few features. The dynamic motion was strong handshaking at approximately 3 m/s and 10 rad/s. As shown in Figure 8, the translation and rotation errors over time were quantified and visualized, illustrating the comparative results of EVO [20] and our method. Additionally, the box plot of translation and rotation errors over 10-s intervals, shown in Figure 9, presented a statistical comparison.

[See PDF for image]

Figure 8

The visualization of the (a) translation error and (b) rotation error between EVO [21], our method and ground truth. Our method shows better estimation under all tested times, in comparison to EVO, especially in strong motion

[See PDF for image]

Figure 9

The pose estimation error comparison of our method and EVO [20]. Our method performs better than EVO in the whole tested time, with less translation (a) error, and (b) rotation error.

Our method outperforms EVO [20], particularly in terms of both accuracy and robustness, as shown in Table 1, owing to its more precise and robust feature detection and tracking, benefiting from the decoded ALM LEDs and the global optimized map. EVO was hindered in low-feature, dynamic environments, where it struggled to maintain accuracy. The translation and rotation errors of the EVO increased significantly as the handheld motion became stronger. This was primarily because of insufficient number of features detected in such environments, which impaired their ability to estimate the camera pose accurately. In contrast, the ALM LEDs in our method were detected and tracked with high precision and robustness, ensuring accurate pose estimation even under dynamic conditions.

Table 1. Statistical analysis of accuracy evaluation reveals that our method significantly outperforms EVO [20] both in terms of accuracy and robustness

	Translation error (m)			Rotation error (°)
	Mean	RMSE	Std.	Mean	RMSE	Std.
EVO	0.323	0.347	0.128	25.834	28.922	13.002
Our method	0.078	0.084	0.032	5.411	6.590	3.760

Bold values indicate that both translation and rotation error of our method perform better than EVO

In summary, our method demonstrated superior accuracy and robustness under all tested time and motion conditions compared to the ground truth provided by OptiTrack, as shown in Figures 8 and 9. As shown in Table 2, the mean translation error for our method was 0.078 m, and the mean rotation error was 5.411°, whereas EVO [20] showed errors of 0.323 m and 25.834°, respectively. These results highlighted the advantages of the proposed method. Thus, our method is well suited for deployment in low-feature environments, providing a superior alternative to existing algorithms for robotics applications requiring precise localization, such as UAV formation control during aggressive motion in sparse-feature environments.

Table 2. Statistical analysis of key performance metrics influenced by varying ALM LED intensity

ALM LED intensity (lux)	Max. detection distance (m)	Translation error (m)			Rotation error (°)
ALM LED intensity (lux)	Max. detection distance (m)	Mean	RMSE	Std.	Mean	RMSE	Std.
827	1	0.351	0.383	0.154	24.35	30.309	18.048
1536	1.7	0.218	0.238	0.096	15.151	18.889	11.28
3078	2.3	0.148	0.161	0.064	10.281	12.738	7.52
6223	3	0.101	0.11	0.043	7.034	8.674	5.076
10270	3.8	0.078	0.084	0.032	5.411	6.59	3.76

Moreover, we evaluated the projected impact of varying the ALM LED intensity on key performance metrics, including the maximum detection distance and accuracy of the 6-DOF localization (translation and rotation errors), as shown in Table 2. A higher intensity significantly improves the detection distance and localization accuracy. For instance, increasing the ALM LED intensity from 827 lux to 10270 lux extended the maximum detection distance from 1 m to 3.8 m and markedly reduced the mean translation error from 0.351 m to 0.078 m and mean rotation error from 24.35° to 5.411°. Conversely, a very low ALM intensity drastically impaired the performance owing to challenges in robust marker detection. Conversely, a very low ALM intensity drastically impaired performance owing to challenges in robust marker detection.

This indicates that an adequate ALM LED intensity is crucial for optimal system performance, directly enabling a greater detection range and substantially enhancing the 6-DOF localization accuracy and robustness.

Localization Accuracy in Spiral UAV Flight

A comparison between our estimation and the ground truth provided by OptiTrack is shown in Figure 10. As observed, the estimated translation and rotation of the high-speed UAV during the spiral flight were highly accurate compared with those of the ground truth. The minimum estimate was 0.00054 m when the motion was relatively slow. The mean error was 0.0866 m, demonstrating the high accuracy of the proposed method for the entire flight. The ALMs were detected across the entire flight path, allowing continuous estimation throughout the flight duration. The root-mean-square error was 0.097 m, demonstrating the robustness of our method for high-speed flight. Even during aggressive maneuvers involving both position and orientation, our method maintained a precise and robust estimation, as shown in Figure 11. The full spiral flight process and the 6-DOF pose estimation are shown in Video 1.

[See PDF for image]

Figure 10

Comparison of the (a) position and (b) orientation of our method in spiral flight with the ground truth from OptiTrack. Our method shows superior performance even in aggressive motion both in position and rotation.

[See PDF for image]

Figure 11

The box plot of estimation error by our method in each 10 s period. The highly accurate estimation of our method in whole tested time is shown

The temporal performance of the proposed method was evaluated by recording the time required for each estimate. Because of the asynchronous Kalman fusion approach, which processes multiple camera pose estimates from different ALMs in µs, our method demonstrated significant advantages in terms of temporal efficiency. As shown in Table 3, the maximum output rate reached 4.5 kHz, with an average output rate of 3.16 kHz, which was notably higher than that of current algorithms. These results highlight the potential of our method for real-time localization in multi-robot tasks.

Table 3. Statistics of the output rate for localization estimation

	Maximum	Minimum	Mean	Median
Our method	4.5 kHz	1.53 kHz	3.16 kHz	3.15 kHz

Conclusions

We present a bio-inspired vision mapping and localization method that leverages ALMs, reprojection error optimization, and asynchronous Kalman fusion to address the challenges in sparse-feature environments. Our approach demonstrates significant improvements in both the mapping accuracy and localization performance under dynamic conditions. The conclusions of this study are as follows:

Enhanced feature robustness in sparse environments: The proposed method successfully overcomes the limitations of traditional feature-based approaches under sparse or low-feature conditions by utilizing ALMs. These active markers provide robust and reliable features for camera pose estimation, enabling consistent performance when natural features are insufficient or unstable.
High-accuracy global mapping through optimization: The proposed method achieved high-precision global mapping by employing reprojection error minimization. Specifically, the ACE of the generated map was reduced from an initial 16.8 cm to 3.1 cm after 400 optimization iterations, showcasing the effectiveness of the mapping refinement process.
Superior localization performance: Compared with the mono-event-based EVO method [20], our approach exhibited significantly better localization accuracy. In dynamic low-feature environments, our method achieved a mean translation error of 0.078 m (EVO: 0.323 m) and a mean rotational error of 5.411° (EVO: 25.834°).
High temporal efficiency and real-time capability: The asynchronous Kalman fusion component enables the µs-level fusion of multiple ALM pose estimations. This high temporal efficiency, combined with rapid ALM detection, allowed for a maximum localization output rate of 4.5 kHz during the UAV spiral flight experiments while maintaining high accuracy.

These results collectively demonstrate the potential of the proposed method for robust and precise robot mapping and localization in challenging real-world scenarios, including those with difficult illumination and aggressive motion, making it suitable for applications such as multi-robot formation control.

Acknowledgements

Not applicable.

Authors' Contributions

Shijie Zhang was responsible for the conceptualization and methodology, and wrote the manuscript; Tao Tang, Taogang Hou, and Tianmiao Wang provided supervision and project administration; Yuxuan Huang developed the software and performed validation; Xuan Pei contributed to visualization and investigation. All authors read and approved the final manuscript.

Funding

Supported by Beijing Natural Science Foundation (Grant No. L231004), Young Elite Scientists Sponsorship Program by CAST (Grant No. 2022QNRC001), Fundamental Research Funds for the Central Universities (Grant No. 2025JBMC039), National Key Research and Development Program (Grant No. 2022YFC2805200), National Natural Science Foundation of China (Grant No. 52371338).

Data availability

All data will be made available on reasonable request.

Declarations

Competing Interests

The authors declare no competing financial interests.

References

[1] Cao, ZC; Yin, LJ; Fu, YL. Vision-based stabilization of nonholonomic mobile robots by integrating sliding-mode control and adaptive approach. Chinese Journal of Mechanical Engineering; 2013; 26, 1 pp. 21-28.

[2] Gallego, G; Delbrück, T; Orchard, G et al. Event-based vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence; 2020; 44, 1 pp. 154-180.

[3] Gehrig, M; Millhauser, A; Scaramuzza, D. EKLT: Asynchronous, photometric feature tracking using events and frames. International Journal of Computer Vision; 2020; 128, 8 pp. 2097-2113.

[4] V Vasco, A Glover, C Bartolozzi. Fast event-based Harris corner detection exploiting the advantages of event-driven cameras. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2016: 4144–4149.

[5] Harris, C; Stephens, M. A combined corner and edge detector. Alvey Vision Conference; 1988; 15, 50 pp. 10-5244.

[6] Scheerlinck, C; Barnes, N; Mahony, R. Asynchronous spatial image convolutions for event cameras. IEEE Robotics and Automation Letters; 2019; 4, 2 pp. 816-822.

[7] Alzugaray, I; Chli, M. Asynchronous corner detection and tracking for event cameras in real time. IEEE Robotics and Automation Letters; 2018; 3, 4 pp. 3177-3184.

[8] P Chiberre, E Perot, A Sironi, et al. Detecting stable keypoints from events through image gradient prediction. IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Nashville, USA, 2021: 1387–1394

[9] X J Shi, Z R Chen, H Wang, et al. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. Advances in Neural Information Processing Systems, 2015, 28.

[10] Von Gioi, RG; Jakubowicz, J; Morel, JM et al. LSD: A fast line segment detector with a false detection control. IEEE Transactions on Pattern Analysis and Machine Intelligence; 2008; 32, 4 pp. 722-732.

[11] Valeiras, DR; Clady, X; Ieng, SH et al. Event-based line fitting and segment detection using a neuromorphic visual sensor. IEEE Transactions on Neural Networks and Learning Systems; 2018; 30, 4 pp. 1218-1230.

[12] Vidal, AR; Rebecq, H; Horstschaefer, T et al. Ultimate SLAM? Combining events, images, and IMU for robust visual SLAM in HDR and high-speed scenarios. IEEE Robotics and Automation Letters; 2018; 3, 2 pp. 994-1001.

[13] Chamorro, W; Sola, J; Andrade-Cetto, J. Event-based line slam in real-time. IEEE Robotics and Automation Letters; 2022; 7, 3 pp. 8146-8153.

[14] Guan, WP; Chen, PY; Xie, YH et al. Pl-evio: Robust monocular event-based visual inertial odometry with point and line features. IEEE Transactions on Automation Science and Engineering; 2023; 21, 4 pp. 6277-6293.

[15] Wang, C; Zhang, JH; Zhao, Y et al. Human visual attention mechanism-inspired point-and-line stereo visual odometry for environments with uneven distributed features. Chinese Journal of Mechanical Engineering; 2023; 36, 62.

[16] A Censi, J Strubel, C Brandli, et al. Low-latency localization by active LED markers tracking using a dynamic vision sensor. IEEE/RSJ International Conference on Intelligent Robots and Systems, Tokyo, Japan, November 3-7, 2013: 891–898.

[17] Chen, G; Chen, WK; Yang, QY et al. A novel visible light positioning system with event-based neuromorphic vision sensor. IEEE Sensors Journal; 2020; 20, 17 pp. 10211-10219.

[18] G R Müller, J Conradt. A miniature low-power sensor system for real time 2D visual tracking of LED markers. IEEE International Conference on Robotics and Biomimetics, Phuket, Thailand, December 7–11, 2011: 2429–2434.

[19] G Ebmer, A Loch, M N Vu, et al. Real-time 6-DoF pose estimation by an event-based camera using active LED markers. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Hawaii, USA, January 4–8, 2024: 8137–8146.

[20] Rebecq, H; Horstschäfer, T; Gallego, G et al. Evo: A geometric approach to event-based 6-dof parallel tracking and mapping in real time. IEEE Robotics and Automation Letters; 2016; 2, 2 pp. 593-600.

[21] Tenzin, S; Rassau, A; Chai, D. Application of event cameras and neuromorphic computing to VSLAM: A survey. Biomimetics; 2024; 9, 7 444.

[22] E Mueggler, C Bartolozzi, D Scaramuzza. Fast event-based corner detection. Proceedings of the British Machine Vision Conference (BMVC), London, UK, September 4–7, 2017.

[23] E Rosten, T Drummond. Machine learning for high-speed corner detection. Computer Vision–ECCV 2006: 9th European Conference on Computer Vision, Graz, Austria, May 7–13, 2006: 430–443.

[24] A Zihao Zhu, N Atanasov, K Daniilidis. Event-based visual inertial odometry. IEEE Conference on Computer Vision and Pattern Recognition, USA, June 21–26, 2017: 5391–5399.

[25] A Zihao Zhu, N Atanasov, K Daniilidis. Event-based feature tracking with probabilistic data association. IEEE International Conference on Robotics and Automation (ICRA), Singapore, May 29–June 3, 2017: 4465–4470.

[26] H Rebecq, T Horstschaefer, D Scaramuzza. Real-time visual-inertial odometry for event cameras using keyframe-based nonlinear optimization Proceedings of the British Machine Vision Conference (BMVC), London, UK, September 4–7, 2017.

[27] J Manderscheid, A Sironi, N Bourdis, et al. Speed invariant time surface for learning to detect corner points with event-based cameras. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, October 27–November 2, 2019: 10245–10254.

[28] D R Da Costa, M Robic, P Vasseur, et al. A new stereo fisheye event camera for fast drone detection and tracking. IEEE International Conference on Robotics and Automation, Vienna, Austria, June 1–6, 2025.

[29] Wang, JN; Gammell, JD. Event-based stereo visual odometry with native temporal resolution via continuous-time Gaussian process regression. IEEE Robotics and Automation Letters; 2023; 8, 10 pp. 6707-6714.

[30] Chen, PY; Guan, WP; Lu, P. Esvio: Event-based stereo visual inertial odometry. IEEE Robotics and Automation Letters; 2023; 8, 6 pp. 3661-3668.

[31] J K Niu, S Zhong, Y Zhou. Imu-aided event-based stereo visual odometry IEEE International Conference on Robotics and Automation (ICRA), Japan, May 13–17, 2024: 11977–11983.

[32] Niu, JK; Zhong, S; Lu, XY et al. Esvo2: Direct visual-inertial odometry with stereo event cameras. IEEE Transactions on Robotics; 2025; 41, pp. 2164-2183.

[33] Everding, L; Conradt, J. Low-latency line tracking using event-based dynamic vision sensors. Frontiers in Neurorobotics; 2018; 12, 4.

[34] W Z Yuan, S Ramalingam. Fast localization and tracking using event sensors. IEEE International Conference on Robotics and Automation (ICRA), Stockholm, Sweden, May 16–21, 2016: 4564–4571.

[35] E Mueggler, B Huber, D Scaramuzza. Event-based, 6-DOF pose tracking for high-speed maneuvers. IEEE/RSJ International Conference on Intelligent Robots and Systems, Chicago, USA, September 14–18, 2014: 2761–2768.

[36] J Bertrand, A Yiğit, S Durand. Embedded event-based visual odometry. 2020 6th International Conference on Event-based Control, Communication, and Signal Processing (EBCCSP), Krakow, Poland, September 23–25, 2020: 1–8.

[37] C Brändli, J Strubel, S Keller, et al. ELiSeD—an event-based line segment detector. 2016 Second International Conference on Event-based Control, Communication, and Signal Processing (EBCCSP), Krakow, Poland, June 13–15, 2016: 1–7.

[38] H W Ren, J D Zhu, Y Zhou, et al. A simple and effective point-based network for event camera 6-dofs pose relocalization. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, June 19–21, 2024: 18112–18121.

[39] S Ghosh, V Cavinato, G Gallego. ES-PTAM: Event-based stereo parallel tracking and mapping. European Conference on Computer Vision, Milano, Italy, September 29–October 4, 2024: 70–87.

[40] Shiba, S; Klose, Y; Aoki, Y et al. Secrets of event-based optical flow, depth and ego-motion estimation by contrast maximization. IEEE Transactions on Pattern Analysis and Machine Intelligence; 2024; 46, 12 pp. 7742-7759.

[41] A El Moudni, F Morbidi, S Kramm, et al. An event-based stereo 3D mapping and tracking pipeline for autonomous vehicles. IEEE 26th International Conference on Intelligent Transportation Systems (ITSC), Bilbao, Spain, September 24–28, 2023: 5962–5968.

[42] X Wang, S A Wang, C M Tang, et al. Event stream-based visual object tracking: A high-resolution benchmark dataset and a novel baseline. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, June 19–21, 2024: 19248–19257.

[43] J H Yuan, H B Li, Y S Peng, et al. Event-based head pose estimation: Benchmark and method. European Conference on Computer Vision, Milano, Italy, September 29–October 4, 2024: 191–208.

[44] S J Zhang, Y X Huang, X Pei, et al. An improved LED aruco-marker detection method for event camera. International Congress on Communications, Networking, and Information Systems, Guilin, China, March 25–27, 2023: 47–57.

[45] Y X Huang, T G Hou, S J Zhang, et al. A Novel ArUco marker for event cameras. 2023 International Conference on Frontiers of Robotics and Software Engineering (FRSE), Zhangjiajie, China, August 8–10, 2023: 271–277.

[46] B Triggs, P F McLauchlan, R I Hartley, et al. Bundle adjustment—a modern synthesis. International Workshop on Vision Algorithms, Berlin, Heidelberg: Springer Berlin Heidelberg, 1999: 298–372.

[47] Marquardt, DW. An algorithm for least-squares estimation of nonlinear parameters. SIAM Journal on Applied Mathematics; 1963; 11, 2 pp. 431-441.1530710112.10505

Word count: 6874

Show less

© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Bio-inspired Vision Mapping and Localization Method Based on Reprojection Error Optimization and Asynchronous Kalman Fusion

Content area

Abstract

Full text