Toward robust, interactive, and human‐aligned AI

Full text

Turn on search term navigation

INTRODUCTION

We live at a time that could be called “AI everywhere, all at once.” As we see increasing deployments of AI systems across a variety of fields, including medicine, manufacturing, education, and transportation, we must confront what is often termed the alignment problem (Brown et al. 2021; Christian 2021), the problem of getting AI systems to behave in the ways that we as humans want them to behave.

In traditional AI and robotics, the reward function is known, and there exist many tools that allow an agent to optimize its behavior to maximize its reward. However, in many real-world systems such as self-driving cars and domestic service robots, the reward function is hard to specify because we need to make nuanced trade-offs between competing and often implicit factors and objectives. Inevitably, the agent will encounter some new situation where optimizing its hand-specified reward function ends up leading to poor behaviors, despite that same reward function having led to good performance in previous situations (Amodei et al. 2016; Tien et al. 2023). This problem is especially pronounced when the task itself is fundamentally about human preferences. Not only do we want a personal service robot to be able to unload the dishwasher and put away the groceries, but we also want it to arrange objects in a way that matches our preferences. Safe and robust performance in these settings fundamentally requires human input, because the task itself is to do what the human wants. Furthermore, many AI and robot tasks require anticipating human behavior (e.g., autonomous cars, service robots, collaborative manipulators in manufacturing, etc.) making it impossible to perform these tasks well without human data.

Long-term vision of human-centered AI

My work is inspired by the vision of a world where AI systems assist and empower humans—a world where AI-enhanced exoskeletons and physical prosthetics empower users with disabilities, where robot surgeons and medics collaborate seamlessly with humans in providing healthcare, where AI systems are robust to out-of-distribution inputs and are transparent and interpretable, and where AI assistants act as cognitive prosthetics to help boost creativity and empower users to live more fulfilling lives. As AI systems become ubiquitous and increasingly competent, it is critical that we ensure that they are beneficial to society. I am a strong proponent of a human-centered approach to AI where, rather than trying to hand-specify a fixed objective for an AI system, we instead give AI systems the objective of optimizing our objectives (Russell 2022). This human-centered approach requires us to consider AI systems, not in isolation, but as assistive, interactive partners with humans. Much work on robust machine learning and control seeks to be resilient to, or completely remove the need for, human input. By contrast, I would argue that human input is an extremely valuable and even critical component of robustness. My research seeks to take steps toward the ultimate goal of building robust, aligned, and human-centered AI systems that achieve safe, reliable, and beneficial outcomes for society.

Research challenges

Much of my work focuses on the problem of robust value alignment: ensuring that robots and other AI systems do what we, as humans, actually want them to do. My work bridges both theory and practice, with an emphasis on theory that is constructive and leads to significant and practical algorithmic improvements. In particular, my research aims to allow AI systems to learn behaviors that are robust to missing or mis-specified reward functions. Designing good reward functions is highly challenging and error-prone, even for domain experts. Consider trying to write down a reward function that describes good driving behavior, how you like your bed made in the morning, or what makes a conversation engaging. While reward functions for these tasks are extremely difficult to write down, human feedback in the form of demonstrations or preferences is often much easier to obtain. However, human data can often be extremely difficult to interpret, due to ambiguity and noise. Thus, it is critical that AI systems are robust to epistemic uncertainty over the human's true intent. Within the areas of reward learning and human intent inference, we have been making fundamental advances in the following challenging research areas: (1) efficiently quantifying uncertainty over human intent, (2) directly optimizing behavior to be robust to uncertainty over human intent, and (3) actively querying for additional human input to reduce uncertainty over human intent.

QUANTIFYING UNCERTAINTY

Maintaining estimates of uncertainty is a critical first step toward robust AI. We want AI systems that can deal with ambiguity and noise in human feedback and that know what they know and what they do not know (see Figure 1).

[IMAGE OMITTED. SEE PDF]

Bayesian reward learning

Much prior research on reward learning either completely ignores uncertainty or uses computationally intractable algorithms which require solving hundreds or thousands of reinforcement learning problems to generate a posterior distribution over likely human reward functions (Ramachandran and Amir 2007). To address this problem, we developed the first scalable Bayesian reward inference algorithm for visual imitation learning domains (Figure 2). This research builds on my earlier work on offline reinforcement learning from human feedback (Brown et al. 2019a; Brown, Goo, and Niekum 2019b), where we developed an approach for learning reward functions from small numbers of ranked, suboptimal demonstrations, resulting in the first scalable reward learning approach to achieve better-than-demonstrator performance. Our key insight for scalable Bayesian reward inference was to combine preference-based reward learning with self-supervised deep learning techniques to learn a lower-dimensional latent state representation where Bayesian reward inference becomes tractable. In comparison to prior Bayesian reward inference approaches, which would take days to run, my research enables Bayesian reward inference in only a few minutes by leveraging a small number of human pairwise preferences over trajectories (Brown et al. 2020a).

[IMAGE OMITTED. SEE PDF]

Bayesian constraint learning

Learning rewards from human feedback is not the only scenario where we want AI systems to maintain uncertainty. In other cases, such as crowd navigation, it may be relatively easy to specify a goal for an AI system (e.g., as a specific landmark or waypoint), but there may still be implicit constraints on the robot's path to this goal that are difficult to manually specify. Learning constraints from demonstrations provides a natural and efficient way to improve the safety of AI systems; however, prior work only considered learning a single, point estimate of the constraints. We proposed a novel Bayesian inverse constraint learning approach that infers a posterior probability distribution over constraints from demonstrated trajectories (Papadimitriou, Anwar, and Brown 2022). Our approach improves on prior constraint inference algorithms by enabling the freedom to infer constraints from partial trajectories and even from disjoint state-action pairs, the ability to infer constraints from suboptimal demonstrations and in stochastic environments, and the opportunity to use the posterior distribution over constraints in order to implement active learning and robust policy optimization techniques. We also recently proposed a novel and highly scalable Bayesian method that infers constraints based on pairwise preferences over trajectories (Papadimitriou and Brown 2024) that can adapt to cases where there are varying levels of constraint violation.

Confidence-based autonomy and self-assessment

Enabling robots and other AI systems to quantify and accurately represent uncertainty unlocks many different capabilities. Here we briefly describe two examples. First, if we want robots that can be robustly deployed in a variety of settings, we would like these robots to be able to provide high-confidence guarantees on their performance. By leveraging Bayesian representations of reward function uncertainty, we developed a novel framework (Brown and Niekum 2017, 2018) for computing high-confidence probabilistic safety bounds that are four orders of magnitude more sample efficient than the prior state-of-the-art bounds (Abbeel and Ng 2004; Syed and Schapire 2007). This enables a critical component of robustness: the ability to provide practical high-confidence bounds on generalization performance when learning from small numbers of human demonstrations. As a novel application of these high-confidence performance bounds, my research was the first to demonstrate a scalable and practical method to automatically detect reward hacking or gaming behaviors, cases where the robot's learned reward function results in a behavior that violates the human's true preferences (Brown et al. 2020a).

Second, by leveraging a Bayesian reward learning approach, combined with high-confidence bounds on performance, we recently demonstrated, for the first time, that a robot can know whether it has enough demonstrations to reach a certain threshold of performance (Trinh, Chen, and Brown 2024). Most existing methods for learning from demonstrations simply provide as many demonstrations as possible and hope that performance will be good. Our work was the first to formalize the idea of demonstration sufficiency, enabling robots to know with high confidence when they can stop requesting demonstrations.

ROBUSTNESS TO UNCERTAINTY

When learning from human inputs, an important challenge is how to deal with epistemic uncertainty. As discussed in the previous section, much of our work has focused on quantifying and maintaining uncertainty when learning from human feedback. But what should an AI system do with its uncertainty estimates? This section discusses how AI systems can optimize their behavior in the face of uncertainty and our work on measuring robustness and detecting misalignment.

Optimizing policies under uncertain reward functions

The field of reinforcement learning is grounded on the idea that there is a single, scalar reward signal that should be optimized. However, when taking the human-centered AI approach that I described earlier, the AI system's reward function is initially unknown. As the AI system interacts with the human, it will gain more information about the human's values and preferences, but it will still have uncertainty over what the human's true reward function. To perform policy optimization in this setting, we proposed a risk-aware approach that can optimize a policy across a distribution of reward functions (Brown, Niekum, and Petrik 2020b). Our research enables AI systems to hedge against uncertainty, rather than seeking to optimize a single best-guess of the human's reward function. On challenging high-dimensional control benchmarks, this approach optimizes policies that are significantly more robust than prior state-of-the-art approaches, which struggle to effectively deal with small numbers of ambiguous human preferences (Javed et al. 2021). By optimizing for multiple objectives simultaneously, our work allows AI systems to be robust to spurious correlations and also provides a practical solution for the increasingly important problem of multi-agent value alignment—optimizing decision policies that reflect and address the values and preferences of multiple groups or individuals.

Measuring robustness of policies and interactions

Along with robust policy optimization approaches, we also recently developed methods for robustness assessment of policies learned from human demonstrations and policies optimized for human-robot collaboration.

A popular trend in modern robotics is to leverage high-capacity neural network models for behavioral cloning: a supervised learning approach that trains policies by imitating expert demonstrations. Learning from demonstration algorithms has shown promising results in robotic manipulation tasks, but their vulnerability to adversarial attacks remains underexplored. We recently performed a comprehensive study of adversarial attacks on both classic and recently proposed implicit (Florence et al. 2021), denoising diffusion (Chi et al. 2023), and transformer-based (Lee et al. 2024) behavioral cloning methods. Our experiments (Patil et al. 2025) reveal that most of the current methods are highly vulnerable to pixel-based adversarial perturbations. We also show that these attacks are often transferable across algorithms, architectures, and tasks, raising concerning security vulnerabilities to black-box attacks. Our findings highlight the vulnerabilities of modern algorithms that learn from demonstrations, paving way for future work in addressing such limitations and making these systems more robust.

We have also extended this kind of vulnerability analysis to robot policies designed for collaborative assistance. The safety of collaborative human-robot interaction tasks can be hard to verify because people can behave unexpectedly at test time, potentially interacting with the robot outside its training distribution and leading to failures. Even just measuring robustness is a challenge. Adversarial perturbations are the default, but they can paint the wrong picture: they can correspond to human motions that are unlikely to occur during natural interactions with people. A robot policy might fail under small adversarial perturbations but work under large natural perturbations. To capture robustness in these interactive settings, we proposed to analyze the Pareto frontier of adversarial human policies that trade off between naturalness and low robot performance (He et al. 2023). Our work provides a meaningful measure of robustness that is predictive of deployment performance and uncovers failure cases in human-robot interaction that are difficult to find through manual stress testing.

Detecting misalignment

Another facet of robustness we are interested in is detecting reward misalignment. We specifically studied Reinforcement Learning from Human Feedback (RLHF), which has emerged as a popular paradigm for capturing human intent by learning a reward function from pairwise preferences over an AI system's output. We started by performing a series of sensitivity and ablation analyses on several benchmark domains where rewards learned from preferences achieve minimal test error but fail to generalize to out-of-distribution states—resulting in poor policy performance when optimized (Table 1). We found that the presence of non-causal distractor features, noise in the stated preferences, and partial state observability all significantly exacerbate reward misidentification (Tien et al. 2023). These findings illuminate the susceptibility of preference learning to reward misidentification and causal confusion and highlight the need for ongoing work to address these challenges.

TABLE 1 Empirical evidence of causal confusion and reward misidentification from our recent work (Tien et al. 2023).

	Pref. accuracy	RL policy performance
			Learned	True	Success
Domain	Train	Test	(pref/gt)	(pref/gt)	(pref/gt)
Reacher	0.96	0.96	−12.0 / -14.9	−11.9 / -5.6	0.1 / 0.8
Feeding	0.98	0.96	106.5 / 68.8	−45.4 / 128.9	0.4 / 1.0
Itching	0.97	0.92	17.9 / 12.7	−68.0 / 248.4	0.0 / 1.0

One of the challenges of determining whether a reward function is misaligned is that most reward learning approaches use black box reward functions that, while expressive, are difficult to interpret and often require running reinforcement learning before we can even decipher if these frameworks are actually aligned with human preferences. We recently made progress toward detecting certain types of misalignment with a novel approach for learning both expressive and interpretable reward functions from preferences using differentiable decision trees (Kalra and Brown 2024). Our experiments provide evidence that the tree structure of our learned reward function is useful as a kind of alignment debugging tool for determining the extent to which the reward function is aligned with human preferences (Figure 3). In particular, we provide evidence that learning rewards as differentiable decision trees can reveal cases of silent misalignment: cases where the learned reward function is based on the wrong (e.g., non-causal) features but leads to behavior that appears aligned based on RL performance in distribution but will fail out of distribution. Importantly, our work is able to reveal silent misalignment without needing to run costly reinforcement learning evaluations.

[IMAGE OMITTED. SEE PDF]

ACTIVELY REDUCING UNCERTAINTY

What if an AI system has so much uncertainty over the human's true reward function that it cannot guarantee good performance with sufficiently high probability? How do we fix a misaligned or non-robust reward function or policy? One natural way to address these challenges is via active learning: generating targeted queries for additional human input in states where the AI system has high uncertainty.

Active reward learning

Learning a reward function from human preferences is challenging, as it typically requires having a high-fidelity simulator or using expensive and potentially unsafe actual physical rollouts in the environment. However, in many tasks the agent might have access to offline data from related tasks in the same target environment. While offline data is increasingly being used to aid policy optimization via offline reinforcement learning (Levine et al. 2020), we recently observed that it can also be a surprisingly rich source of information for preference learning as well. We proposed an approach that uses an offline dataset to craft preference queries via pool-based active learning, learns a distribution over reward functions, and optimizes a corresponding policy via offline reinforcement learning (Shin, Dragan, and Brown 2023). Crucially, our proposed approach does not require actual physical rollouts or an accurate simulator for either the reward learning or policy optimization steps. Excitingly, our results demonstrated that combining offline RL with learned human preferences can enable an agent to learn to perform novel tasks that were not explicitly shown in the offline data.

We also recently studied the benefits and challenges of using a learned dynamics model (Figure 4) when performing active reward learning (Liu et al. 2023). In particular, we found that a learned dynamics model offers several benefits when performing RLHF, including: (1) significant reductions in environment interactions when performing preference elicitation and policy optimization, (2) diverse preference queries can be synthesized safely and efficiently as a byproduct of standard model-based RL, and (3) reward pre-training based on suboptimal demonstrations can be performed without any environmental interaction. Our results provide empirical evidence that learned dynamics models enable robots to learn customized policies based on user preferences in ways that are safer and more sample efficient than prior preference learning approaches.

[IMAGE OMITTED. SEE PDF]

Interactive imitation learning

It is important not to forget that active queries impose a burden on a human supervisor. To address this concern, we proposed a novel interactive imitation learning algorithm that reduces context switches by a human supervisor by only asking for help in critical states (states with either high novelty or high risk) (Hoque et al. 2021). Our approach yielded a synergistic improvement, resulting in 21% fewer human interventions, 57% higher human performance on side tasks, and 80% more robot throughput—having robots ask the right questions at the right times gives the robot more useful data but also limits supervisor burden by allowing the human to focus on other tasks when their help is not needed. Importantly, this enables a single human supervisor to simultaneously manage an entire fleet of robots with minimal cognitive workload.

We have started extending this work to surgical robot learning (Thompson et al. 2025), where measuring novelty and risk is critical. However, due to variability in tissue geometries and stiffnesses, learning-based approaches for the autonomous manipulation of soft tissue often perform poorly, especially in out-of-distribution settings. We recently proposed the first application of uncertainty quantification to learned surgical soft-tissue manipulation policies as an early identification system for task failures (Figure 5). We validated our approach using a physical surgical robot performing physical soft-tissue manipulation. By endowing surgical robots with uncertainty quantification, robots can successfully detect out-of-distribution states that might lead to task failure and actively request human intervention when necessary. Our learned tissue manipulation policy with uncertainty-based early failure detection achieves a zero-shot sim2real performance improvement of 47.5% over the prior state of the art in learned soft-tissue manipulation and generalizes to new types of tissue as well as more complex bimanual surgical tasks (Thompson et al. 2025).

[IMAGE OMITTED. SEE PDF]

CONCLUSION

As AI systems become commonplace in our daily lives, it is crucial that these systems are robust to the complexity and ambiguities that abound in the real world. I hope that you will join in the grand challenge of creating intelligent agents that can safely and seamlessly learn from and adapt to humans. To accomplish this goal, we need to continue to develop improved AI systems that can manage uncertainty over a wide range of sources, can efficiently fuse multiple forms of human feedback, and enable efficient verification and interpretability.

I lead the Aligned, Robust, and Interactive Autonomy (ARIA) Lab at the University of Utah, where we are actively developing the next generation of algorithms for aligning robot representations with humans, robust and reliable reinforcement learning from human feedback, surgical robot automation, intelligent exoskeletons, and algorithms for human-machine teaming that allow AI systems to learn from multiple humans and that allow a single human to interact with and teach large teams of robots. We are especially excited about applying our research to a wide range of challenging safety-critical applications, including domestic service, surgical, and assistive robots, and personalized digital assistants and tutors. In pursuit of this goal, we recently presented the first application of reinforcement learning from human feedback for robot surgery (Karimi et al. 2024) and are working to extend our results on uncertainty quantification and active learning to a variety of complex tasks, including shared autonomy (Belsare et al. 2025), human-multi-robot interaction (Mattson and Brown 2023), assistive exoskeletons (Thompson, Zhang, and Brown 2023), post-traumatic rehab, and large language models.

Successfully and efficiently integrating human input into the study of robust AI and robotics will not only require extending existing learning techniques but will also require developing new theoretical and algorithmic techniques that will benefit from insights from fields such as human factors, causal inference, cognitive science, robust control, and formal verification. I am excited by the many opportunities for collaboration between traditionally disparate fields as we work toward the goal of building human-centered AI systems that achieve safe, reliable, and beneficial outcomes for society.

ACKNOWLEDGMENTS

Daniel S. Brown leads the Aligned, Robust, and Interactive Autonomy (ARIA) Lab at the University of Utah. ARIA Lab research is supported in part by the NSF (IIS-2310759, IIS2416761), the NIH (R21EB035378), ARPA-H, Open Philanthropy, and the ARL STRONG program.

CONFLICT OF INTEREST STATEMENT

The author declares that there is no conflict.

References

Abbeel, Pieter, and Andrew Y. Ng. 2004. “Apprenticeship Learning via Inverse Reinforcement Learning.” In Proceedings of the Twenty‐First International Conference on Machine Learning (ICML), 1–8. Banff, Alberta, Canada: ACM.

Amodei, Dario, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. 2016. “Concrete Problems in ai Safety.” arXiv preprint arXiv:1606.06565.

Belsare, Atharv, Zohre Karimi, Connor Mattson, and Daniel S Brown. 2025. “Toward Zero‐Shot User Intent Recognition in Shared Autonomy.” In International Conference on Human‐Robot Interaction (HRI).

Brown, Daniel S., and Scott Niekum. 2017. “Toward Probabilistic Safety Bounds for Robot Learning From Demonstration.” In AAAI Fall Symposium on AI for HRI, AAAI.

Brown, Daniel S., and Scott Niekum. 2018. “Efficient Probabilistic Performance Bounds for Inverse Reinforcement Learning.” In Proceedings of the AAAI Conference on Artificial Intelligence, 5234‐5241, New Orleans, Louisiana: AAAI.

Brown, Daniel S., Wonjoon Goo, Prabhat Nagarajan, and Scott Niekum. 2019a. “Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning From Observations.” In International Conference on Machine Learning (ICML), 1109–1118. Long Beach, California: PMLR.

Brown, Daniel S., Wonjoon Goo, and Scott Niekum. 2019b. “Better‐Than‐Demonstrator Imitation Learning via Automatically‐Ranked Demonstrations.” In Proceedings of the 3rd Conference on Robot Learning (CoRL), 330–349. Zurich, Switzerland: PMLR.

Brown, Daniel S., Russell Coleman, Ravi Srinivasan, and Scott Niekum. 2020a. “Safe Imitation Learning via Fast Bayesian Reward Inference From Preferences.” In International Conference on Machine Learning (ICML), 2142–2151. Vienna, Austria: PMLR.

Brown, Daniel S., Scott Niekum, and Marek Petrik. 2020b. “Bayesian Robust Optimization for Imitation Learning.” In Neural Information Processing Systems (NeurIPS), 2962–2972. Vancouver, Canada.

Brown, Daniel S., Jordan Schneider, Anca D. Dragan, and Scott Niekum. 2021. “Value Alignment Verification.” In Proceedings of the 38th International Conference on Machine Learning (ICML), 1460–1470. PMLR.

Chi, Cheng, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. 2023. “Diffusion Policy: Visuomotor Policy Learning via Action Diffusion.” In Proceedings of Robotics: Science and Systems (RSS), RSS Proceedings.

Christian, Brian. 2021. The Alignment Problem: How Can Machines Learn Human Values?. Atlantic Books.

Florence, Pete, Corey Lynch, Andy Zeng, Oscar Ramirez, Ayzaan Wahid, Laura Downs, Adrian Wong, Johnny Lee, Igor Mordatch, and Jonathan Tompson. 2021. “Implicit Behavioral Cloning.” In Conference on Robot Learning (CoRL), 1015–1030. London, UK: PMLR.

Zhi‐Yang He, Jerry, Daniel S. Brown, Zackory Erickson, and Anca Dragan. 2023. “Quantifying Assistive Robustness via the Natural‐Adversarial Frontier.” In Conference on Robot Learning (CoRL), 1865–1886. PMLR.

Hoque, Ryan, Ashwin Balakrishna, Ellen Novoseller, Albert Wilcox, Daniel S. Brown, and Ken Goldberg. 2021. “Thriftydagger: Budget‐Aware Novelty and Risk Gating for Interactive Imitation Learning.” In 5th Annual Conference on Robot Learning (CoRL), 1090–1101. London, UK: PMLR.

Javed, Zaynah, Daniel S. Brown, Satvik Sharma, Jerry Zhu, Ashwin Balakrishna, Marek Petrik, Anca D. Dragan, and Ken Goldberg. 2021. “Policy Gradient Bayesian Robust Optimization.” In International Conference on Machine Learning (ICML), 4870–4880. PMLR.

Kalra, Akansha, and Daniel S. Brown. 2024. “Can Differentiable Decision Trees Enable Interpretable Reward Learning From Human Feedback?.” In Reinforcement Learning Conference (RLC).

Karimi, Zohre, Shing‐Hei Ho, Bao Thach, Alan Kuntz, and Daniel S. Brown. 2024. “Reward Learning From Suboptimal Demonstrations With Applications in Surgical Electrocautery.” In 2024 International Symposium on Medical Robotics (ISMR), 1–7. IEEE.

Lee, Seungjae, Yibin Wang, Haritheja Etukuru, H. Jin Kim, Nur Muhammad, Mahi Shafiullah, and Lerrel Pinto. 2024. “Behavior Generation With Latent Actions.” In Proceedings of the 41st International Conference on Machine Learning (ICML).

Levine, Sergey, Aviral Kumar, George Tucker, and Justin Fu. 2020. “Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems.” arXiv preprint arXiv:2005.01643.

Liu, Yi, Gaurav Datta, Ellen Novoseller, and Daniel S. Brown. 2023. “Efficient Preference‐Based Reinforcement Learning Using Learned Dynamics Models.” In 2023 IEEE International Conference on Robotics and Automation (ICRA), 2921–2928. IEEE.

Mattson, Connor, and Daniel S Brown. 2023. “Leveraging Human Feedback to Evolve and Discover Novel Emergent Behaviors in Robot Swarms.” In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO), 56–64. Lisbon, Portugal: ACM.

Papadimitriou, Dimitris, and Daniel S. Brown. 2024. “Bayesian Constraint Inference From User Demonstrations Based on Margin‐Respecting Preference Models.” In 2024 IEEE International Conference on Robotics and Automation (ICRA), 15039–15046. IEEE.

Papadimitriou, Dimitris, Usman Anwar, and Daniel S. Brown. 2022. “Bayesian Methods for Constraint Inference in Reinforcement Learning.” Transactions on Machine Learning Research (TMLR).

Patil, Basavasagar, Akansha Kalra, Guanhong Tao, and Daniel S. Brown. 2025. “How vulnerable is my policy? Adversarial attacks on modern behavior cloning policies.” arXiv preprint arXiv:2502.03698.

Ramachandran, Deepak, and Eyal Amir. 2007. “Bayesian Inverse Reinforcement Learning.” In Proceedings of the 20th International Joint Conference on Artificial Intelligence, vol. 7, 2586–2591. Hyderabad, India: IJCAI Organization.

Russell, Stuart. 2022. “If We Succeed.” Daedalus 151(2): 43–57.

Shin, Daniel, Anca Dragan, and Daniel S. Brown. 2023. “Benchmarks and Algorithms for Offline Preference‐Based Reward Learning.” Transactions on Machine Learning Research (TMLR).

Syed, Umar, and Robert E. Schapire. 2007. “A Game‐Theoretic Approach to Apprenticeship Learning.” In Advances in Neural Information Processing Systems (NeurIPS), 1449–1456. Vancouver, Canada: NeurIPS.

Thompson, Jordan, Haohan Zhang, and Daniel S Brown. 2023. “Towards a Gaze‐Driven Assistive Neck Exoskeleton via Virtual Reality Data Collection.” In HRI Workshop on Virtual, Augmented, and Mixed‐Reality for Human‐Robot Interactions.

Thompson, Jordan, Ronald Koe, Anthony Le, Gabriella Goodman, Daniel S. Brown, and Alan Kuntz. 2025. “Early Failure Detection in Autonomous Surgical Soft‐Tissue Manipulation via Uncertainty Quantification.” RSS Workshop on Out‐of‐Distribution Generalization in Robotics.

Tien, Jeremy, Jerry Zhi‐Yang He, Zackory Erickson, Anca Dragan, and Daniel S Brown. 2023. “Causal Confusion and Reward Misidentification in Preference‐Based Reward Learning.” In The Eleventh International Conference on Learning Representations (ICLR).

Trinh, Tu, Haoyu Chen, and Daniel S. Brown. 2024. “Autonomous Assessment of Demonstration Sufficiency via Bayesian Inverse Reinforcement Learning.” In Proceedings of the 2024 ACM/IEEE International Conference on Human‐Robot Interaction (HRI), 725–733. Arlington, VA: ACM/IEEE.

Word count: 4434

Show less

© 2025. This work is published under http://creativecommons.org/licenses/by-nc/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Ensuring that AI systems do what we, as humans, actually want them to do is one of the biggest open research challenges in AI alignment and safety. My research seeks to directly address this challenge by enabling AI systems to interact with humans to learn aligned and robust behaviors. The way robots and other AI systems behave is often the result of optimizing a reward function. However, manually designing good reward functions is highly challenging and error‐prone, even for domain experts. Although reward functions are often difficult to manually specify, human feedback in the form of demonstrations or preferences is often much easier to obtain but can be difficult to interpret due to ambiguity and noise. Thus, it is critical that AI systems take into account epistemic uncertainty over the human's true intent. As part of the AAAI New Faculty Highlight Program, I will give an overview of my research progress along the following fundamental research areas: (1) efficiently quantifying uncertainty over human intent, (2) directly optimizing behavior to be robust to uncertainty over human intent, and (3) actively querying for additional human input to reduce uncertainty over human intent.

Details

Title

Toward robust, interactive, and human‐aligned AI systems

Author

Brown, Daniel S.¹

¹ Kahlert School of Computing, University of Utah, Salt Lake City, Utah, USA

Section

HIGHLIGHT

Publication year

2025

Publication date

Sep 1, 2025

Publisher

John Wiley & Sons, Inc.

ISSN

07384602

e-ISSN

23719621

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1002/aaai.70024

ProQuest document ID

3251333240

Toward robust, interactive, and human‐aligned AI systems

Jump to:

Full text

Abstract

Details

Suggested sources