Content area
The ability to understand, generate, and ultimately act within our 3D world, an ability referred to as spatial intelligence, is a fundamental aspect of human cognition and a central goal of artificial intelligence. However, developing spatial intelligence faces fundamental challenges from scaling compute and data, which are the very two factors that drive the progress of modern AI. This thesis aims to advance spatial intelligence by developing a suite of compute- and data-efficient algorithms that address these challenges.
From the compute perspective, this thesis introduces a set of highly efficient 3D and 4D algorithms that achieve over 1000x speedups and 10,000-100,000x memory savings compared to existing methods, while maintaining comparable or better performance. These algorithms, widely adopted in subsequent research, significantly improve the efficiency and scalability of 3D and 4D pipelines, and can be seamlessly integrated into standard deep learning frameworks to better utilize compute resources.
From the data perspective, this thesis investigates how to leverage 2D foundation models to overcome the scarcity of 3D data and supervision. This leads to a series of data-efficient algorithms for both 3D generation and 3D understanding, the two primary pillars of spatial intelligence. For the first time, we demonstrate that large-scale 3D scenes can be generated purely from 2D generative priors, and that 3D vision-language grounding can be advanced by distilling knowledge from 2D vision-language models without requiring direct 3D supervision. These methods highlight the potential of 2D foundation models to enhance spatial intelligence through data efficiency.
Finally, this thesis explores model self-improvement as an alternative approach to mitigate data scarcity in both 2D and 3D domains. We show that 2D vision-language models can iteratively improve themselves by generating, refining, and learning from their own data through self-inspection, supported by image editing tools. Although this exploration begins in 2D, it opens a promising direction toward extending self-improving models to 3D and 4D domains—bringing us closer to scalable, generalizable spatial intelligence.