Content area
Abstract
Semantic scene completion (SSC) is essential for autonomous driving and 3D scene understanding, closely mirroring the way humans perceive and interpret complex environments. A key element in human perception is the utilization of temporal memory, which facilitates the rapid recognition and recall of previously observed elements. To emulate this capability in artificial intelligence systems, we have enhanced the VoxFormer—a model originally designed for spatial transformation—by integrating a temporal memory component. Our upgraded model, VoxFormer v2, incorporates tri-plane deformable temporal attention and recurrent temporal fusion strategy. These innovations significantly improve the model’s ability to process and understand short-term temporal dynamics in scene data. Performance evaluations on the SemanticKITTI and KITTI-360 datasets have shown that VoxFormer v2 establishes a new state-of-the-art for SSC performance.





