Content area
Reinforcement learning (RL) has made significant strides in the last few years by proposing increasingly more complex networks that use larger and larger amounts of data to solve a vast host of problems, from playing games to autonomous navigation. Continuing along this trajectory is infeasible for those who do not have access to the large amounts of computing power, data storage, or time required to perpetuate this trend. Additionally, these networks suffer from low sample efficiency and struggle to generalize to out of distribution data. This thesis proposes that leveraging the hierarchical structure inherent in many real world problems, specifically navigation, while efficiently incorporating socially cognizant design into model training and ideation can provide an alternative to this data- and compute-hungry approach.
We start with the hypothesis that using networks that mirror the hierarchical structure inherent in many tasks will allow for better overall task performance using simpler networks. We take inspiration from the temporal abstraction of human cognitive processes and compare the performance of several flat neural network architectures and hierarchical paradigms in the maze traversal task. The temporally abstracted actions, also called subroutines, of hierarchical networks happen over multiple time steps and allow agents to reason over complex skills and actions (like leaving a room or going around a corner) instead of low level motor commands. We find that learning a policy over these temporally abstracted actions leads to faster training times, more training stability, and increased accuracy over standard RL or supervised learning with LSTMs.
Using this insight, we next explore whether a predefined set of subroutines used by hierarchical networks provides better performance than a learned set. We create a hierarchical framework, comprised of a manager network that passes information to a worker network via a goal vector, for autonomous vehicle steering angle prediction from egocentric videos. The manager network learns an embedding space of subroutines from historical vehicle information. This learned subroutine embedding from the manager allows the worker network to more accurately predict the next steering angle than when using predefined subroutines. Additionally, this hierarchical framework shows improvements over state of the art steering angle prediction methods.
In the real world, it is uncommon for the full set of subroutines needed to accomplish a task to be known a priori. Additionally, in order to have a complete autonomous navigation agent, it is imperative that the agent has a model of pedestrian behavior. In the next set of experiments, we aim to address these concerns by building a network to learn a dictionary of pedestrian social behaviors in a self-supervised manner. We use this dictionary to analyze the relationship between pedestrian behavior and the spaces they inhabit as well as the relationships between subroutines themselves. We also use this behavior embedding network in a hierarchical framework to constrain the state space for a worker network, allowing for future pedestrian trajectories to be predicted using a very simple architecture.
Finally, we combine our findings into a hierarchical, socially cognizant, visual navigation agent. Instead of formalizing navigation into a traditional reinforcement learning framework, we implicitly learn to mimic optimal human navigation policies from collected demonstrations for the image-goal task in a simulated environment. We build a hierarchical framework with three levels. The first network builds a latent space that acts as a memory module for the navigation agent. The second network predicts waypoints in the current observation space indicating which area of the environment to move towards. The third network predicts which action to execute in the environment with a simple classifier network. The key to this method's success is that each of these networks operates at a different temporal or spatial scale, thus allowing them to bootstrap off of each other to incrementally solve a much larger navigational task and achieve SOTA results without the use of RL, graphs, odometry, metric maps, or other computationally complex and memory intensive methods.
