Full text

Turn on search term navigation

INTRODUCTION

SLAM is the core technology for robots to understand their external environment and to orientate themselves. From the perspective of using sensors, the SLAM problem can be solved by visual sensor [1], radar sensor [2], and multi-module fusion [3]. However, natural working environments contain lots of moving objects, such as people and other mobile robots. The greatly reduced robustness of feature-based visual localisation systems brought about by the motion of objects has led to the failure of most existing SLAM systems. Besides, these distorted-shaped moving obstacles should be removed or avoided in the reconstruction. In early response to this dynamic environmental issue, a straightforward solution is to extend the existing static SLAM algorithms by detecting and removing the dynamic objects [4–6]. Some previous works [7–9] made efforts to recognise humanoid motions in such changing environments. More recently, another idea is to involve motion segmentation methods, for example, optical flow based FlowFusion [10], voxel segmentaion-based StaticFusion [11]. These motion segmentation type approaches decouple dynamic pixels from static background pixels by comparing the consistency of camera motion clusters and removing dynamic pixel points. Whether the method is used for motion segmentation or object recognition, both frameworks expend significant computational resources to discover and remove the dynamic components of the visual perceptual input. These methods typically rely on dedicated GPUs to cope with the real-time performance of dynamic obstacles. This limits their use in online mobile robotic systems where computational resources are limited.

Audio human-robot interaction (HRI) requires the detection of different environmental sound sources. Audio HRI requires detecting different environmental sound sources. This function usually requires a microphone array to localise, track, and disambiguate various sound sources in real-time. In the field of DOA research, classical approaches are primarily based on estimating the time difference of arrival (TDOA) between microphones to localise and track speakers in real environments. Knapp et al. in ref. [12] present a classical TDOA estimation method with a generalised cross-correlation. Chen et al. in ref. [13] proposed a TDOA estimation framework for single-speaker localisation. In the case of multiple speakers, Ishi et al. [14] provided a well-known multi-signal classification approach. More recently, Li et al. proposed a direct path relative transfer function method combined with exponential gradients to locate and track multiple moving loudspeakers [15]. The method is robust to reverberation effects, which is particularly important for indoor SLAM applications [16]. It is worth noting that sound source localisation is usually limited by the structural design of the microphone array and only estimates the direction of the sound source without range/depth estimation.

In this paper, we combine the strengths of v-slam and DOA approaches, using DOA to detect the direction of noisy mobile robots, mark and process regions of dynamic obstacles in RGB-D point clouds through heterogeneous information fusion, and ultimately solve the dynamic SLAM problem in mobile robot collaboration scenarios. Vision-Audio signal Fusion has several benefits for mobile robots, see Figure 1: (1) Reduced onboard power and computational cost. DOA method, for example, ref. [15] uses the onboard CPU for efficient online multi-source azimuth detection. (2) An efficient and robust approach to fusing DOA and visual-SLAM. (3) The microphone array can also provide valuable voice interaction functions.

[IMAGE OMITTED. SEE PDF]

In summary, these work contributions are:

A novel dynamic environment multi-AGV-robot cooperation solution based on sound source detection and visual odometry with sparse features.
An efficient and robust vision-audio fusion approach for dynamic SLAM application.
A visual odometry database with simultaneous sound and RGB-D images.

RELATED WORKS

Mobile robot visual perception

The main tasks of environment awareness for mobile robots include environment recognition and robot self-positioning, both of which can be performed within the SLAM framework. In multi-agent mobile collaborative systems, most of the existing available solutions attempt to deal with dynamic environments by recognising and removing dynamic objects. Based on their object recognition methods, we categorise the current state-of-the-art approaches into those based on motion segmentation and those based on object detection.

Object recognition based solutions are more intuitive. Thanks to the current hot development of deep learning-based object recognition and image segmentation models, one can quickly identify pre-defined movable objects, including people, motor vehicles, small animals and even robots whose models are known, and in turn remove pixels of these objects and then enable classical visual SLAM methods to reconstruct the environment in a dynamic environment. A Typical system is DynaSLAM [4] proposed by Bescos et al., which spends about 300 ms per frame to apply Mask-RCNN to detect human targets in RGB images before employing ORB-SLAM2 (ORB2) [17] for camera self-localisation and background reconstruction. Z. Li et al. recently proposed a new Deep Collaborative Embedding based model using Context-Based Tandem Network for Semantic Segmentation problem in ref. [18]. Time-consuming is a pain of most deep learning based object detection methods.

Methods based on motion segmentation do not directly target the underlying semantics of objects, shapes, they try to find the properties of moving objects in sensing signals such as image pixels, laser point clouds, sonar or audio signals to distinguish the dynamic components of the environment. Typical systems include StaticFusion (SF) [11] by Scona et al., FlowFusion (FF) [10] by Zhang et al., and the multi-object motion segmentation approach [19] by Judd et al. SF combines scene flow computation with Visual Odometry to enable real-time reconstruction of small-sized rooms with static backgrounds. static background reconstruction. FF utilises optical flow residuals for dynamic object segmentation and removes dynamic point clouds for dense background reconstruction. The method [19] applies sparse feature point alignment to separate and track multiple rigid objects. In addition, Dai et al. propose to use motion coherence to distinguish between dynamic and static visual feature points [20]. All of these solutions require the powerful arithmetic power provided by the GPU and there are currently no examples of their implementation on multi-machine systems for mobile robots.

Vision-audio fusion methods

Audiovisual integration has been studied for many years in several research fields, such as human-computer interaction [21] and video captioning [22]. Li et al. developed an HRI system based on sound source localisation in ref. [21]. They calibrated the corresponding pixel coordinates of the sound source. In their robotic experiments, a NAO robot head with four microphones performed robust orientation localisation under difficult acoustic conditions. In the context of multi-source localisation, Hospedales et al. proposed a framework for audiovisual fusion based on a Bayesian model [23]. They can segment, associate, and track multiple objects in an audiovisual sequence. In addition, Ban et al. propose an audiovisual fusion method [24] for tracking multiple speaker displacements. Their method will fuse transfer function features and a Bayesian face observation model. In a subsequent work [25], they updated this multi-speaker tracking module. The effect of locating the sound source within the bounding box of the speaker's head pixels was achieved. However this work utilised CNN-based human target detection, which required 500 ms per image to process (GTX 1070 GPU).

The robots in the above work are stationary to observe human targets. For mobile robots, in ref. [26], Evers et al. propose an acoustic SLAM scheme that breaks the concept of robotic SLAM. The scheme applies acoustic source localisation techniques to passively localise and track a moving acoustic target while mapping the location of surrounding sources. In contrast to conventional SLAM, it cannot be used for robot self-localisation and environment perception. In addition, the work [27] proposed to use multiple mobile port microphone arrays to cooperate with the implementation of DOA. In this work, in order to address the limitation that DOA can only determine the direction of the sound source and cannot estimate the depth of the source, proposes to use the intersection of the extension lines of the sound source direction to estimate the sound source location.

METHOD

We propose a multi-mobile robot vision-SLAM system that uses DOA technology to detect and discover dynamic objects that emit noise. The flowchart is shown in Figure 2: We use the Azure Kinect sensor to acquire synchronised audio, vision and depth signals. After feature extraction and clustering, we obtain the source azimuths of the sound signals received from the seven microphone arrays. We then project the sound source azimuth onto the horizontal optical axis of the camera and fuse the DOA into the image space. In the next SLAM task, we extract visual features (yellow dots) in the point cloud outside the detected sound source region for camera motion estimation and environment reconstruction, thus removing the influence of the sounding objects on the system.

[IMAGE OMITTED. SEE PDF]

Online direction of arrival estimation

Problem formulation

In the time domain, the multi-microphone signals are (with the index of discrete time omitted): x ^m = a ^m*s, where * denotes convolution, and m = 1, …, M denotes the index of microphone. The m-th microphone signal x ^m is the convolution between the source signal s and room impulse response (RIR) a ^m. In the short-time Fourier transform (STFT) domain, with the convolutive transfer function (CTF) approximation: ${x}_{t,f}^{m}={a}_{t,f}^{m}\ast {s}_{t,f}$ , where t and f denote the indices of time-frame and frequency, respectively. In this model, the STFT coefficients of microphone signal ${x}_{t,f}^{m}$ is the temporal convolution between the STFT coefficients of source signal s _t,f and CTF ${a}_{t,\,f}^{m}$ . CTF is a down-sampled representation of RIR at one STFT frequency. The direct-path propagation presents at the beginning RIR taps, and correspondingly at the first CTF coefficient.

DOA estimation amounts to estimating the direct-path propagation from the sound source to the microphone array. As for estimating the direct-path propagation, the entire CTFs are first blindly identified based on the cross-relation method [28] using the microphone signals, and then only the first CTF coefficient of multiple microphones are employed as the direct-path feature. For notational convenience, hereafter the frequency index f will be omitted in this section, as each frequency will be processed independently in this section.

Direct-path feature estimation

Consider a pair of microphone (m, n), the cross-relation ${x}_{t}^{m}\ast {a}_{t}^{n}={x}_{t}^{n}\ast {a}_{t}^{m}$ always holds. Let P denote the CTF length, CTF and microphone signal can be denoted in vector form as ${\mathbf{a}}^{m}={\left[{a}_{0}^{m},\text{\ldots },{a}_{P-1}^{m}\right]}^{T}$ and ${\mathbf{x}}_{t}^{m}={\left[{x}_{t}^{m},\text{\ldots },{x}_{t-P+1}^{m}\right]}^{T}$ , where ^T denotes matrix/vector transpose. Then the cross-relation can be written in vector form as 1 ${\mathbf{x}}_{t}^{m\ T}{\mathbf{a}}^{n}={\mathbf{x}}_{t}^{n\ T}{\mathbf{a}}^{m}.$

We concatenate the CTF vector of all (microphone) channels as $\mathbf{a}={\left[{\mathbf{a}}^{1\ T},\text{\ldots },{\mathbf{a}}^{M\ T}\right]}^{T}$ . Accordingly, the microphone signal vectors are concatenated as: 2 ${\mathbf{x}}_{t}^{mn}={\left[0,\text{\ldots },0,{\mathbf{x}}_{t}^{n\ T},0,\text{\ldots },0,-{\mathbf{x}}_{t}^{m\ T},0,\text{\ldots },0\right]}^{T},$ where the zero elements respond to the CTF vectors of other microphones. So that the cross-relation (1) can be written as ${\mathbf{x}}_{t}^{mn\ T}\mathbf{a}=0$ . This equation can be trivially satisfied by setting a to a all-zero vector 0. To avoid this trivial solution, we restrict the first CTF coefficient of a reference channel, for example, m = c, to be equal to 1, and then the equation becomes 3 ${\mathbf{x}}_{t}^{mn\ T}\mathbf{a}=0,\quad s.t.\quad {a}_{0}^{c}=1$

To solve this constrained problem, we divide a by ${a}_{0}^{c}$ , and the new equation ${\mathbf{x}}_{t}^{mn\ T}\mathbf{a}/{a}_{0}^{c}=0$ automatically satisfies the constraint. Then, reorganise the equation, yields 4 ${\tilde{\mathbf{x}}}_{t}^{mn\ T}\tilde{\mathbf{a}}={z}_{t}^{mn},$ where ${\tilde{\mathbf{x}}}_{t}^{mn}$ is identical to ${\mathbf{x}}_{t}^{mn}$ , except that the entry corresponding to ${a}_{0}^{c}$ is removed, and ${-}{z}_{t}^{mn}$ is such entry. The new variable $\tilde{\mathbf{a}}$ is obtained from a by first removing ${a}_{0}^{c}$ and then divided by ${a}_{0}^{c}$ . The first CTF coefficient of multiple microphones (except for the c-th microphone) are represented in $\tilde{\mathbf{a}}$ as the ratio between their original value to ${a}_{0}^{c}$ , that is, ${a}_{0}^{m}/{a}_{0}^{c},\ m\ne c$ . We refer to these ratios of direct-path transfer functions as direct-path relative transfer functions (DP-RTFs), which are powerful directional features of sound sources, as they encode the direct propagation paths.

We solve the linear problem Equation (4) to estimate the DP-RTFs for DOA estimation. In the online manner, the microphone signals ${\tilde{\mathbf{x}}}_{t}^{mn}$ and ${z}_{t}^{mn}$ are received frame-wisely, and accordingly the online estimation of $\tilde{\mathbf{a}}$ will also be conducted frame by frame. More specifically, based on the estimation of $\tilde{\mathbf{a}}$ at the previous frame, it is recursively updated using the signals of the current frame. At one frame, Equation (4) is defined for one microphone pair, and all the K = M(M − 1)/2 distinct microphone pairs are utilised. For notational convenience, instead of mn, we use k = 1, …, K to denote the index of microphone pair. The matching error of Equation (4) is defined as ${e}_{t}^{k}={\tilde{\mathbf{x}}}_{t}^{k\ T}\tilde{\mathbf{a}}-{z}_{t}^{k}$ . Utilising the recursive least square (RLS) method, the online cost function up to frame t and microphone pair k is defined as 5 ${J}_{t}^{k}=\sum\limits _{{t}^{\prime }=1}^{t-1}{\lambda }^{t-{t}^{\prime }}\sum\limits _{{k}^{\prime }=1}^{K}\vert {e}_{{t}^{\prime }}^{{k}^{\prime }}{\vert }^{2}+\sum\limits _{{k}^{\prime }=1}^{k}\vert {e}_{t}^{{k}^{\prime }}{\vert }^{2},$ where λ ∈ (0, 1] is a forgetting factor. This cost function adds up all the matching errors up to frame t and microphone pair k, but gives a exponentially decaying weight to the older frames by λ and a unit weight to the current frame to catch up the variation of $\tilde{\mathbf{a}}$ . Note that the different microphone pairs has the same weights at one frame. By minimising this cost function, the CTF vector $\tilde{\mathbf{a}}$ can be efficiently estimated with the RLS algorithm (please find more details from ref. [15]).

The estimation of $\tilde{\mathbf{a}}$ is updated frame by frame, denoted as ${\widehat{\mathbf{a}}}_{t}$ for the tth frame. This estimation is naturally suitable for the moving speaker or microphone array case, as it tracks the variation of $\tilde{\mathbf{a}}$ by exponentially forgetting the older frames. This estimation is also naturally suitable for the multiple-speaker case, as it is an estimation for each time-frequency bin. Based on the W-disjoint orthogonality assumption [29], each time-frequency bin is dominated by only one speaker, because of the natural sparsity of speech signals in the time-frequency domain. Therefore, ${\widehat{\mathbf{a}}}_{t}$ of one time-frequency bin belongs to only one of the multiple speakers. In the presence of ambient noise, the accuracy of estimated ${\widehat{\mathbf{a}}}_{t}$ will degrade. To mitigate the effect of noise, the inter-frame spectral subtraction algorithm proposed in ref. [30] is integrated into the current framework.

Finally, add the frequency index f, the DP-RTFs are recognised from ${\widehat{\mathbf{a}}}_{t,f}$ , denoted as ${a}_{t,f}^{m},\ m\in [1,M],\ m\ne r;\ k\ \in [0,K-1]$ , and each of them is associated with a single speaker. The DOA estimation of multiple speakers is conducted by clustering ${a}_{t,f}^{m}$ across all frequencies and frames, which will be presented in the next subsection.

Feature clustering for DOA estimation

We adopt the complex Gaussian mixture model (CGMM) for feature clustering. First, the whole localisation space, for example, the 360° azimuth space on a horizontal plane, is divided into a group of discrete candidate directions, and let i = 1, …, I denote the index of candidate directions. For one candidate direction, the theoretical DP-RTF can be pre-computed using the ideal model of the signal's direct propagation path. The theoretical DP-RTF is denoted as ${\bar{a}}_{f}^{m,i}$ . Each candidate direction is represented with one Gaussian component of CGMM. The prior w ⁱ (w ⁱ ≥ 0 and ${\sum }_{i=1}^{I}{w}^{i}=1$ ) of Gaussian components denote the probability that one feature is emitted by the i-th candidate direction a priori. The probability, that one feature ${a}_{t,f}^{m}$ is emitted by all candidate directions, is the mixture probability: 6 $P\left({a}_{t,f}^{m}\right)=\sum\limits _{i=1}^{I}{w}^{i}{\mathcal{N}}_{c}\left({a}_{t,f}^{m};{\bar{a}}_{k}^{m,i},{\sigma }^{2}\right),$ where the variance σ ² is empirically set as a constant value. Likelihood maximisation is performed using the well-known expectation-maximisation algorithm to estimate the prior probability w ⁱ as 7 ${\widehat{w}}^{i}=\underset{{w}^{i},\ i=1,\text{\ldots },I}{\mathrm{a}\mathrm{r}\mathrm{g}\mathrm{m}\mathrm{a}\mathrm{x}}\ \sum\limits _{{a}_{t,f}^{m}}\log \left(P\left({a}_{t,f}^{m}\right)\right).$

For the moving speaker or microphone array case, w ⁱ is time-varying and can be estimated with recursive expectation-maximisation algorithm. The time-varying estimate of w ⁱ is denoted as ${\widehat{w}}_{t}^{i}$ . After maximising the likelihood of observed directional features, that is, ${a}_{t,f}^{m}$ , ${\widehat{w}}_{t}^{i}$ then represents the probability that the ith direction is activated by speakers. DOA estimation can be conducted by detecting the candidate directions with large ${\widehat{w}}_{t}^{i}$ , more specifically detecting the peaks of ${\widehat{w}}_{t}^{i}$ along the i axis.

In this work, we use the microphone array embedded on an Azure Kinect to conduct DOA estimation. The topology of the microphone array is shown in Figure 3, which is composed of seven microphones arranged in a 2D plane. The microphone array is placed to be parallel to the horizontal plane, which is thus suitable to perform horizontal (azimuth) localisation. A total of I = 72 candidate azimuth angles are set with 5° gap between two adjacent angles to cover the whole 360° azimuth space.

[IMAGE OMITTED. SEE PDF]

Vision-audio data association

Our previous work [21] made some efforts to associate audio signal information to pixel matrix for base fixed robot, but this pre-calibration form vision-audio fusion mode doesn't work for a mobile robot. It's hard to label or re-calibrate the sound source positions in image plane during the robot motions. Thus, in this work, we avoid these calibration phase, start with the sensor mechanics, the results of the sound source detection are updated simultaneously to the camera imaging plane according to the camera extrinsic matrix.

The specific fusion include two steps. Firstly, for the RGB-D point clouds, it is noted as two reference frames: camera focal centre O _c and microphone centre O _m, as shown in Figure 3 at time t, an image pixel ${x}_{C}^{t}$ could be warped from O _c to O _m through: 8 ${\mathbf{x}}_{M}^{t}=W\left({\mathbf{x}}_{C}^{t},T(\xi ),{D}_{C}\right)$ in which, W stands for the warping processing: 9 $W\left({\mathbf{x}}^{t},T,D\right)=\pi \left(T{\pi }^{-1}\left({\mathbf{x}}^{t},D\left({\mathbf{x}}^{t}\right)\right)\right).$ where x means a 2D image pixel, and its depth value is obtained by D(x). The 3D space point to 2D image plane transformation is denoted as $\pi :{\mathbb{R}}^{3}\to {\mathbb{R}}^{2}$ , this operation is basing on camera intrinsic matrix, which is pre-known. T(ξ) ∈ SE(3) denotes the extrinsic matrix.

The second step is to mark the DOA in RGB-D point clouds. Our DOA model output frequency is 125 Hz, which is about 3 times higher than Kinect camera fps. Thus, we interpolate the nearest DOA result to the RGB image using Equation (8), to associate the vision-audio information. Take Figure 3 as an example, the DOA module outputs the sound source azimuth angle ϕ. The camera image is then warped to the reference system O _m where the microphone array is located, then the azimuth ray is extended to intersect the image plane. The position of the sound source is then located on a red plumb line through this intersection point. The sound source must on the vertical red line of x, the detected azimuth angle of the sound source is projected onto the image plane. Note that, according to Equation (9), the image plane of ${O}_{m}^{t}$ and ${O}_{c}^{t}$ are different, we need the depth D _w to transform x _m to x _c, but actually these depth measurement is unknown. In fact, the moving obstacles are far away from the camera (at least more than 1 m), thus these distance is much more than the length between the camera centre to the microphone centre (7.4 cm). Therefore, we can adopt D _C instead D _M.

Visual feature extraction and visual odometry

The DOA module outputs the probabilities of the noise emitter in azimuth and then estimates the direction of the sound source by detecting the peak position of these probabilities or weights. However, in 3D vision, we are more concerned with the location information of dynamic obstacles obtained based on this direction. This means that we need to delineate an image area to encompass as much of the entire dynamic obstacle as possible, based on the direction of the sound source and depending on the magnitude of the detected sound source weight.

As mentioned above, the DOA of sound sources are estimated by detecting the peak of GMM weights ${\widehat{w}}_{t}^{i}$ along the i axis. The peak directions are denoted as i _t,j ∈ [1, I], j = 1, …, J, where J denotes the number of detected sound sources. The centre of the visual obstacles are set according to be the sound source directions i _t,j ∈ [1, I], j = 1, …, J. In addition, the visual boundaries of obstacles should be detected to process the whole visual obstacles. For each moving obstacle, in order to estimate its left and right boundaries (i.e. the sound source object area), for example, the right boundary ${b}_{t,j}^{\text{right}}$ of visual obstacle i _t,j, we evaluate the GMM wights ${\widehat{w}}_{t}^{i}$ one by one from i _t,j + 1 to its right candidate directions, and stop until one of the following conditions is satisfied: 10 ${\widehat{w}}_{t}^{i+1}\ge {\widehat{w}}_{t}^{i}\quad \text{or}\quad {\widehat{w}}_{t}^{i+1}\le \delta {\widehat{w}}_{t}^{{i}_{t,j}},$ then the boundary ${b}_{t,j}^{\text{right}}$ is set to the direction i. The meaning of these conditions are i) the weight can only decrease, otherwise a new sound source emerges; ii) the weight should larger than $\delta {w}_{t}^{{i}_{t,j}}$ , where 0 ≤ δ ≤ 1 is empirically set according to the size of visual obstacles, based on our prior knowledge. To summarise, at frame t and for sound source (visual obstacle) j, the obstacle region is represented by the obstacle centre i _t,j (red line as shown in Figure 1–5), and the obstacle boundaries $\left\{{b}_{t,j}^{\text{left}},{b}_{t,j}^{\text{right}}\right\}$ (green bounding box as shown in there pictures). Overall, the multiple moving visual obstacles, adopted for the following SLAM step, are detected and tracked by these (Audio-DOA estimation based) moving regions. Using the frame map Figure 2 as an example: After acquiring the sound source localisation results, we locate the sound source confidence region in the visual information and mark it with a green box, subsequently removing the depth information within this region to remove the effect of dynamic obstacles on the robot's visual perception system. By which, we ensure that the SLAM system extracts visual features in the remaining static environmental background, guaranteeing robustness of the robot for tasks such as visual odometry, loop detection, and static environment reconstruction.

EXPERIMENT RESULTS

Audio signal processing

Look at the Figures 1 and 4 for the DOA module experiments, which show the sound sources detecting performance when the microphone-equipped robot is static. In our system, we set the detection resolution at a 360° azimuth to 5°, that is, I = 72, and the CTF length p = 8. The sampling rate of the sound signal from our sensor is 16 kHz. The STFT has a window length of 256 and a hop count of 128, and the final algorithm outputs a sound source detection frequency of 125 Hz. Within a range of 3 m, our DOA positioning results are stable and accurate. Outside of this the positioning error grows with depth. These experiments are done on a laptop computer, which has Intel Core^TM i7 CPU @ 2.30 GHz × 8, 64 GB System memory.

In terms of audio signal processing accuracy, the final localisation and navigation results output by the SLAM method are also heavily influenced by the confidence region of the sound source. For example, Figure 4 shows a moving human, and his body appears outside the confidence zones, since his body is farther away from the sound source (the phone). Obviously larger obstacles occupy more areas in the image, thus we set confidence parameters for the sound source zone in our multi-agent experiments based on the size of the scene robot. Looking at Figure 5, in image (a) The confidence zones for the Spark-T and AGV collaborative robots were set to 20 and 10 based on our experimental performance. There pictures indicate that extracted image features are only from the static background of the image outside the confidence region of the sound source, thus reporting the robustness of the SLAM method. Where the width of the green bounding box, that is, the size of the confidence region. Image (b) shows that without our method, many of the visual feature points are distributed over the surface of the moving robot, which leads to errors in the robot's self-localisation.

Visual SLAM

In order to evaluate the performance of the proposed SLAM method, comparison experiments were done with the original ORB2, and StaticFusion, with the sound source objects removed (ORB2+DOA). For GPU based SF method, a Nvidia 2080 super was adopted. The performance of the environmental reconstruction is compared in a multi-agent moving scene (We made 5 Spark-T robots in the Sq sequence and two AGV robots in the Mo sequence). A adaptability control was proposed to enhance the locomotion robustness significantly in complex terrain conditions for biped robot [31]. In our case, we set the sensors on AGV's platform to keep audio and vision signal smoothness. We used a motion capture system to obtain robot motion ground truth trajectories.

Table 1 lists the ATE Root-Mean-Square-Error (RMSE) of the different methods. In the experiments with the Spark robot, the ORB2 method performed with an ATE of about 25 cm, looking at the output camera trajectory, and initially, the method performed well with the visual odometer in the beginning of the sequence (Sq1). And after the other mobile robots appeared in the field of view, the wrong camera trajectory was output. In contrast, our method achieves an error of about a third of ORB2 in this scene. Figure 6 shows that our output camera trajectories are very close to the ground truth curve. The SF method performs very poorly in this sequence. It has an ATE of more than a few metres, so we do not give his trajectory. The reason we think this scene is not suitable for the SF method is that it tends to judge static backgrounds, which make up a relatively low percentage of the image pixels, as dynamic obstacles when there are multiple occurrences like this where the camera image is obscured by a large area of moving obstacles. The moving obstacle is judged as a static background. Figure 7 gives a realistic representation of the environment reconstruction for these methods. The estimated camera trajectory is shown in blue. In these two scenes, one robot performs SLAM and the other acts as a dynamic obstacle. In contrast to the experiments where the ORB2 and SF methods lose stability in robot localisation, our method builds an accurate map of the environment. Looking at the evaluation results for the Sqark-T robot sequence, our method achieves an ATE of 8 cm over ORB2+DOA, and the mapping results are accurate.

TABLE 1 Visual-odometry performance: ATE (metre).

Scene	ORBSLAM	StaticFusion	ORB with DOA	Ours
Spark-T robots
Sp1	0.24	3.4	0.079	0.1
Sp2	0.23	12.8	0.088	0.078
Sp3	0.35	26.47	0.22	0.14
Sp4	0.2	3.69	0.12	0.13
Sp5	0.25	2.35	0.2	0.19
AGV-cooperation robots
Mo1	1.32	14.2	0.13	0.12
Mo2	1.59	18.28	0.089	0.088
Mo3	0.94	0.71	0.038	0.04
Mo4	1.45	1.53	0.18	0.18

[IMAGE OMITTED. SEE PDF]

The ORB-SLAM frameworks has no mapping thread like the other classical SLAM works. Since they use visual feature point map for robot self-localisation, and these maps are synthesised with the estimated camera trajectories. Figure 7 shows the environmental reconstruction results for two sets of multi-agent SLAM environment, 5 Spark robots in the Sq3 sequence and two AGV robots in the Mo3 sequence, for the environmental maps output by our method. The second column of images shows the results of the original ORB2. As can be seen, other moving robots cause severe map distortions; the third column of the SF approach, where unstable motion segmentation classifies certain robots as static objects, appears in the map. The last column our method, which reconstructs the environment very accurately. We have compared the online efficiency of several methods in Table 2. We use the classical ORB-SLAM2 as a baseline method for visual SLAM of static environments. DynaSLAM and SF are GPU supported dense reconstruction methods designed for dynamic environment. Although DynaSLAM's SLAM module is ORB2-based, it uses Mask R-CNN to perform detailed object segmentation on each image frame in the pre-processing stage, so that its overall frame rate is only 0.3. SF performs 3D point cloud segmentation in face elements, thus achieving 17 fps online motion segmentation, but this mode is not robust in the presence of multiple moving targets. Thus, it is not suitable for multi-agent navigation application. In this case, the dynamic object detector processes only the signal from the microphone array. In these experiments, the sound signal sampling frequency was set to 16k, and we obtained the audio detection and visual odometry performance of 125 and 14 fps, respectively.

TABLE 2 Computation time.

Approach	fps	GPU
ORB2 [17]	26	×
DynaSLAM [4]	0.3	✓
SF [11]	17	✓
Ours	14	×

In addition to the multi-robot scenario, we use the well-known public dataset [32] to compare with other RGB-D dense dynamic SLAM methods. See Table 3. We compared our method with three leading-edge dynamic SLAM methods, [20, 33], and [34]. The Table shows the ATE results of these four in TUM fr3 RGB-D SLAM static and walking (dynamic) sequences. Dai et al.[20] use feature point associations to handle dynamic pixels. This point-based idea is similar to ours, so they perform similarly to us. [34] is a typical object segmentation-based framework, they carefully detect and remove the moving objects in front of the camera (human objects in TUM sequences), thus they do better than other in dynamic four fr3/walking-sequences, but worse in normal static sequence. [33] is a typical visual-inertial fusion framework based on ref. [35]. They acquire more accurate ATE performance with the help of inertial odometers.

TABLE 3 Comparison of ATE RMSE (m) in TUM dynamic sequences.

Sequence	Xie et al. [34]	Dai et al. [20]	Liu et al. [33]	Ours
fr3/s_static	0.007	0.01	-	0.006
fr3/s_xyz	0.013	0.009	-	0.009
fr3/s_h	0.019	0.024	-	0.02
fr3/s_rpy	0.043	0.023	-	0.029
fr3/w_static	0.01	0.011	0.008	0.022
fr3/w_xyz	0.014	0.087	0.049	0.084
fr3/w_h	0.028	0.035	0.061	0.152
fr3/w_rpy	0.033	0.161	0.063	0.301

CONCLUSION AND FUTURE WORKS

In this paper, we proposed a novel vision-audio fusion method to integrate noisy detection into visual-SLAM. We adopt sound source detection as a dynamic object detector, combine it with dense environment reconstruction solution to achieve dynamic SLAM multi-agent/robot scene. Experimental results show that the method proposed in this paper can guarantee robustness of robot self-positioning and navigation even in heterogeneous robotic multi-machine systems. Our DOA-based approach requires only ordinary on-board arithmetic to outperform the GPU-based dynamic environment approach. This is a very promising point for multi-computer collaborative systems. Our next steps are the fine segmentation of sound source confidence regions and dynamic object recognition.

ACKNOWLEDGEMENTS

This work is supported by the Shenzhen Science and Technology Program (JSGG20220606142803007) and the Shenzhen Institute of Artificial Intelligence and Robotics for Society (AIRS).

CONFLICT OF INTEREST STATEMENT

The authors declare no conflicts of interest.

DATA AVAILABILITY STATEMENT

The data that support the findings of this study are available on request from the corresponding author. The data are not publicly available due to privacy or ethical restrictions.

References

Lee, T.‐j. , Kim, C.‐h. , Cho, D.‐i.D. : A monocular vision sensor‐based efficient slam method for indoor service robots. IEEE Trans. Ind. Electron. 66(1), 318–328 (2018). [DOI: https://dx.doi.org/10.1109/tie.2018.2826471]

Word count: 5732

Show less

© 2023. This work is published under http://creativecommons.org/licenses/by-nc/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Moving humans, agents, and subjects bring many challenges to robot self‐localisation and environment perception. To adapt to dynamic environments, SLAM researchers typically apply several deep learning image segmentation models to eliminate these moving obstacles. However, these moving obstacle segmentation methods cost too much computation resource for the onboard processing of mobile robots. In the current industrial environment, mobile robot collaboration scenario, the noise of mobile robots could be easily found by on‐board audio‐sensing processors and the direction of sound sources can be effectively acquired by sound source estimation algorithms, but the distance estimation of sound sources is difficult. However, in the field of visual perception, the 3D structure information of the scene is relatively easy to obtain, but the recognition and segmentation of moving objects is more difficult. To address these problems, a novel vision‐audio fusion method that combines sound source localisation methods with a visual SLAM scheme is proposed, thereby eliminating the effect of dynamic obstacles on multi‐agent systems. Several heterogeneous robots experiments in different dynamic scenes indicate very stable self‐localisation and environment reconstruction performance of our method.

Details

Title

Vision‐audio fusion SLAM in dynamic environments

Author

Zhang, Tianwei¹

; Zhang, Huayan²

; Li, Xiaofei³

¹ Shenzhen Institute of Artificial Intelligence and Robotics for Society, Shenzhen, China
² Department of Robotics, Ritsumeikan University, Shiga, Japan
³ Westlake University and Westlake Institute for Advanced Study, Hangzhou, China

Pages

1364-1373

Section

SPECIAL SECTION:INTELLIGENCE TECHNOLOGY FOR REMOTE SENSING IMAGE

Publication year

2023

Publication date

Dec 1, 2023

Publisher

John Wiley & Sons, Inc.

e-ISSN

24682322

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1049/cit2.12206

ProQuest document ID

3091965436

Vision‐audio fusion SLAM in dynamic environments

Jump to:

Full text

Abstract

Details

Suggested sources