Content area
The semantic scene graph represents an innovative approach to 3D scene representation. Its semantic object nodes demonstrate robust viewpoint invariance, efectively overcoming the limitations of traditional visual localization methods. By organizing dense indoor environments into hierarchical data structures, semantic scene graphs enable effcient representation and processing. In multi-agent SLAM systems, this representation supports both coarse-to-fne localization and bandwidth-effcient communication.
This thesis addresses the challenges of registering two rigid semantic scene graphs, an essential capability when an autonomous agent needs to register its map against a remote agent, or against a prior map. The hand-crafted descriptors in classical semantic-aided registration, or the ground-truth annotation reliance in learning-based scene graph registration, impede their application in practical real-world environments. To address the challenges, we design a scene graph network to encode multiple modalities of semantic nodes: open-set semantic feature, local topology with spatial awareness, and shape feature. These modalities are fused to create compact semantic node features. The matching layers then search for correspondences in a coarse-to-fne manner. In the back-end, we employ a robust pose estimator to decide transformation according to the correspondences. We manage to maintain a sparse and hierarchical scene representation. Our approach demands fewer GPU resources and fewer communication bandwidth in multiagent tasks. Furthermore, we propose a novel data generation pipeline that leverages vision foundation models combined with a semantic mapping module to reconstruct semantic scene graphs. This approach eliminates the need for ground-truth semantic annotations, enabling fully selfsupervised network training. Extensive evaluation on a two-agent SLAM benchmark demonstrates that our method: (1) achieves signifcantly higher registration success rates compared to hand-crafted baselines, and (2) maintains superior registration recall relative to visual localization networks while requiring merely 52 KB of communication bandwidth per query frame - representing orders-of-magnitude improvement in transmission effciency.