Content area

Abstract

The research in this dissertation is motivated by the challenge of building computational systems that can perceive, understand, and interact with the world from a first-person, or egocentric, perspective. The central premise of this work is that the egocentric viewpoint is fundamental for creating technologies enabling entities that can effectively and safely collaborate with people in their daily environments, from augmented reality assistants to mobile robots.

A principal impediment to progress in this domain is the scarcity of large-scale, diverse datasets that provide the necessary supervision for training robust models. Specifically, there is a need for data that concurrently captures rich, first-person sensory inputs with their corresponding three-dimensional world states and actions. The difficulty in acquiring and annotating such data at scale motivated the primary technical challenges that this thesis aims to address.

To overcome this data scarcity problem, the research presented here is structured around the paradigm of a "perception-simulation loop". This framework treats perception and simulation as symbiotic and complementary processes, where each can be used to improve the other. The contributions of this dissertation are therefore organized into two main parts, each investigating a different arc of this loop.

The first part of the thesis focuses on perception, investigating methods that learn directly from egocentric visual data. The work on COPILOT explores the use of large-scale synthetic data for near-term collision prediction, while the work on LookOut leverages targeted real-world data collection to model longer-term navigational intent in dynamic environments. The second part of the thesis shifts to simulation and modeling, exploring the use of strong priors to generate plausible human-world interactions. Here, MultiPhys demonstrates a physics-based approach to refining multi-person motion estimates, while ActAnywhere introduces a data-driven approach, using a generative model trained on large-scale video to synthesize semantically coherent scenes.

Collectively, these projects demonstrate a multi-faceted strategy for mitigating the data scarcity problem in egocentric perception and simulation. The thesis concludes with a summary of these contributions and a discussion of promising future research directions. I hope that the methods and insights presented here will contribute to the development of more powerful, robust, safe, and intuitive interactive systems.

Details

1010268
Business indexing term
Title
Perceiving and Simulating Human-World Interactions for Egocentric Agents
Number of pages
121
Publication year
2025
Degree date
2025
School code
0212
Source
DAI-B 87/5(E), Dissertation Abstracts International
ISBN
9798265427540
Committee member
Fatahalian, Kayvon; Liu, Karen
University/institution
Stanford University
University location
United States -- California
Degree
Ph.D.
Source type
Dissertation or Thesis
Language
English
Document type
Dissertation/Thesis
Dissertation/thesis number
32316518
ProQuest document ID
3275492734
Document URL
https://www.proquest.com/dissertations-theses/perceiving-simulating-human-world-interactions/docview/3275492734/se-2?accountid=208611
Copyright
Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.
Database
ProQuest One Academic