Designing Reward Functions Using Active

Full text

Turn on search term navigation

1. Introduction

Reinforcement learning algorithms are renowned for their innovation, and their application in autonomous driving has the potential to surpass human capabilities. To replicate human driving behaviors, we designed an algorithm to learn how to navigate an autonomous vehicle under the sole guidance of visual information. The scenario for this purpose involves an agent with knowledge of start and end points, navigating solely by observing the surrounding environment through images. After gaining extensive experience in a particular area, human drivers can reach their destinations based solely on visual cues, without needing explicit route directions. We aim for the agent to emulate this ability, navigating from the initial to the final position without a predefined route.

Reinforcement learning [1,2] is a paradigm by which humans acquire skills. We aim to harness this method to guide an agent from a starting point to a destination using visual information alone, without reliance on predetermined paths. Our approach has proven valid, as evidenced by our successful application of reinforcement learning techniques in autonomous navigation [3], which yielded exceptionally favorable results in basic linear navigation tasks. Specifically, we achieved an impressive success rate of 99.8% in the CARLA test scenario, with highly desirable trajectory outcomes. Despite this success, the algorithm still faces limitations when applied to more complex scenarios. It is important to note that its generalization capabilities are constrained, and it struggles with intricate navigation tasks, which limits its practical application in real-world scenarios.

The complexities of reinforcement learning [4,5,6,7,8,9,10] present significant challenges in autonomous driving navigation. Developing an optimal navigation policy in the face of infinite-dimensional state and action spaces is arduous. Crafting a reward mechanism that encapsulates many factors, such as speed, potential collisions, and successful navigation, is crucial for guiding the vehicle from its starting point to the final destination. Designing reward functions that align with human objectives becomes increasingly challenging under the constraints of multiple, often competing, goals. The results obtained from training with artificially designed reward functions may not accurately reflect human intentions. When dealing with multiple objectives [11], determining how to assign reward weights to each objective, and ensuring that the improvement of one objective does not come at the cost of others, are challenges in the reward design process.

In this context, reward functions serve as a medium through which humans communicate desired behaviors to agents. However, constructing an accurate reward function to encapsulate human intentions in complex, high-dimensional problems poses a considerable challenge. Preference-based learning emerges as a solution to this difficulty. It involves an agent learning the reward function to be optimized by querying a human user for preferences among a set of presented trajectories. This method shares similarities with inverse reinforcement learning, as it allows for the training of policies without explicitly defining the parameters of the reward function. We utilized preference-based learning algorithms to facilitate the learning of reward functions.

Typically, learning reward functions involves designing them with a low-dimensional parameter space to enhance interpretability. State-of-the-art algorithms provide the best way to learn linear reward functions based on a set of features. Among these features are hand-coded functions of the vehicle’s trajectory, which remain fixed during learning. The agent utilizes responses to preference queries to determine the relative importance weights indicative of each feature. This approach involves presenting users with various alternative behaviors for a specific agent task and soliciting their preferences. Assuming that an underlying reward function guides the user’s choices, the goal is to unveil this function. Here, the reward function is modeled as a weighted sum of predefined features, simplifying the learning process to determine these weights. Central to this approach is the selection of solutions presented to the user, so the reward function can be deduced from a limited number of queries and the reliability of the user’s ability to discern between these alternatives.

Designing reward functions in artificial intelligence and robotics presents a substantial challenge. It is inherently difficult for humans to explicitly articulate all the favorable behaviors a robot should aim to optimize. The crux of the challenge in leveraging reinforcement learning for intricate autonomous driving navigation tasks lies in crafting a reward function that genuinely reflects the intentions of the human operator. Tradition suggests that assigning a high reward for reaching a destination or meeting a significant penalty for collisions would sufficiently train an agent to adopt an optimal policy.

However, we have found that such an approach can be myopic. Merely emphasizing a high reward for reaching the endpoint or harshly penalizing collisions can result in myopic strategies; the trained policy may, for instance, lead the agent to repeatedly return to the same spot or exhibit unwarranted risk-taking behaviors. Navigating through visual input alone is particularly daunting in algorithms with heightened complexity and without effective routing mechanisms. Consequently, we adopted a human-in-the-loop reward-based learning method to address these challenges.

An alternative approach is inverse reinforcement learning (IRL) [12,13,14], which involves deducing a reward function from observed expert behavior. While it has shown promise, IRL is not without its drawbacks. It is computationally intensive, requiring repeated solutions of reinforcement learning problems to approximate the reward function, typically exceeding the complexity of conventional reinforcement learning. Moreover, solutions derived from IRL can be ambiguous, often yielding multiple reward function candidates that rationalize the observed behavior. This ambiguity complicates the identification of a single “correct” reward function. The accuracy of the inferred function is heavily contingent on the quality of the expert’s demonstrations, and poor or insufficient data can result in flawed outcomes.

Furthermore, IRL’s applicability is constrained by its susceptibility to overfitting, especially with smaller datasets, and its limited adaptability to certain domains. The intricacies of reward structures in real-world scenarios, such as the necessity to consider long-term benefits and multiple objectives, frequently transcend the assumptions underpinning IRL. The need for reward function updates in dynamic environments further adds to the computational overhead and resource demands. Additionally, incorrect assumptions within the model can lead to inaccurate inferences. Privacy and ethical considerations are also paramount when dealing with personal behavioral data.

In contrast, active preference reward learning (APRL) [15,16,17] is an approach that elicits users’ relative preferences on various trajectories, which is more intuitive and accessible than IRL. APRL strategically prompts users with questions that significantly impact decision-making, substantially enhancing learning efficiency and effectiveness. It integrates a reinforcement learning framework, making it adaptable across various environments and tasks. When faced with scarce or expensive labeling data, APRL turns to preference learning to determine action values by engaging user preferences, thereby overcoming the limitations of limited labeling resources.

This study used the active preference-based reinforcement learning (APRL) methodology to construct a reward function tailored for autonomous driving navigation tasks. This method entailed the synthetic generation of trajectory features designed to fulfill various reward-related criteria. Moreover, we incorporated a human behavior model to accurately represent human feedback and decision-making within the framework of the system, providing demonstrations and responses.

Our experimental setup within the CARLA environment involved training the optimal policy using the soft actor–critic (SAC) algorithm [18,19], focusing on the reward weights. We define the optimal strategy as the autonomous vehicle’s ability to navigate from a starting point to a final destination. Firstly, we generate trajectory queries based on empirical strategies, then develop preferences according to human intentions as shown in Figure 1, and subsequently construct a belief model based on those preferences. To refine the belief model as shown in Figure 2, we adopted the No-U-Turn Sampler (NUTS) [20] approach, thus eliminating the need for manual adjustments in the sampling step size and iteration count, as required by the traditional Markov chain Monte Carlo (MCMC) techniques [21,22,23], including the Metropolis–Hastings algorithm. The NUTS algorithm’s adaptive and self-tuning nature enhances user-friendliness and significantly reduces the influence of human-related variability in sampling efficiency. Furthermore, we conducted a comparative analysis, measuring the cosine similarity between the user and belief models. Our approach demonstrated a substantial performance improvement compared to conventional MCMC methods. Our investigation revealed that the APRL-formulated reward function significantly accelerates the policy learning process, facilitating the rapid attainment of optimal policies within a reduced temporal frame. This sophisticated reward function enhances the model’s adaptability to complex task environments. Remarkably, it also ensures higher success rates, even in tasks with stochastic initial and final positions.

This paper offers the following contributions:

The application of an active preference learning methodology to generate reward functions using human preference updates to learn about reward weights.
The employment of the No-U-Turn Sampler (NUTS) to update the belief model for reward weights, a highly efficient and adaptable MCMC method free from manual tuning, particularly useful in dealing with complex probabilistic models and multi-modal distributions.
The adoption of the mutual information method as an active preference optimizer to produce informative and comprehensible questions. This is achieved through maximizing information gain, resulting in effective queries for reward learning that better reflect human preferences.
The demonstration that the reward weights produced by our method achieve a high success rate in enhancing policies for autonomous driving navigation tasks, even under conditions involving randomized initial positions and without detailed routing information or high-precision mapping. Our approach solely relies on input from the autonomous vehicle’s forward-facing monocular camera and internal sensors.

2. Related Works

In recent years, our literature search has focused on reward design, leading to significant findings in inverse reinforcement learning. We have identified the current state of the art in active preference learning and propose the challenge of crafting reward functions using this approach.

2.1. Inverse Reinforcement Learning (IRL)

Brian D. Ziebart et al. [13] address the challenges of imitation learning and introduce a maximal entropy-based inverse reinforcement learning method. Imitation learning aims to predict the behavior and choices of an agent, such as object manipulation or driving routes. Predicting the outcomes of extended decision-making processes poses a significant challenge for standard statistical machine-learning algorithms due to the complexity of modeling deliberate and sequential choices. Contemporary approaches involve structuring the learning policy spatially for effective search, planning, and Markov decision processes (MDPs).

When designing reward functions, the focus is typically on training scenarios where rewards encourage desired behaviors. However, maximizing the same rewards could lead to undesirable behavior in unseen situations. Dylan Hadfield-Menell et al. [14] propose inverse incentive design, deducing the underlying objective from the given incentives and training MDPs. They provide an approximate solution to the inverse reward design (IRD) problem and demonstrate its utility in encouraging risk-averse planning in MDPs, thus mitigating the negative consequences of reward misspecification.

Pieter Abbeel et al. [12] present an algorithm where an expert aims to maximize a reward function represented as a linear combination of features. Their method utilizes inverse reinforcement learning to learn the expert’s unknown reward function, concluding within a limited number of iterations and generating policies that closely match the expert’s performance by discerning the expert’s underlying reward structure.

2.2. Active Preference Reward Learning

Riad Akrour et al. [24] proposed a reinforcement learning algorithm incorporating preference learning. This algorithm is particularly useful in domains such as swarm robotics, where conventional and inverse reinforcement learning is infeasible due to challenges in reward function design and obtaining target behavior demonstrations. The method introduces a reinforcement learning technique that integrates active ranking with preference learning to achieve an optimal policy by minimizing the number of ranking queries required from experts. Empirical studies support that an effective strategy can be obtained with a limited number of such inquiries. The core strategy of the algorithm is to develop an approximate active ranking criterion, which maps parameters to behaviors and refines it using empirical data to enhance the policy.

Paul F. Christiano and colleagues [25] have investigated methods for defining goals based on non-expert human preferences over trajectory segments. Their research demonstrates the feasibility of solving complex reinforcement learning tasks without a predefined reward function, reducing the need for extensive human supervision. The approach exhibits its versatility by training new behaviors quickly, leveraging online feedback, and adapting to the demands of modern deep reinforcement learning.

Dorsa Sadigh and co-authors [15] developed a method that acquires the essential reward function through individual preferences. This method involves collecting evaluative feedback from participants regarding the relative merits of two trajectories. Despite its novel approach, the methodology faces challenges related to the complexity of queries and the necessity for continuous learning progress. They propose an active learning algorithm designed to optimize the query by maximizing the volume outside the hypothesis space, achieved through a weighted combination of the hypothesis space and the Metropolis algorithm, which estimates the objective function. The algorithm has been empirically verified to rapidly converge to the required reward function in autonomous driving contexts, and user studies have confirmed its superiority in accurately learning the reward function.

Erdem Bıyık and associates [26] introduced a comprehensive framework for integrating diverse information sources from human users, obtained both passively and actively. Their algorithm initializes the reward function based on user demonstrations and then actively refines this understanding through preference questions. This approach facilitates the integration of various data sources and advises the robot on when to utilize different types of information. Vivek Myers and colleagues [27] proposed a method for extracting rewards from rankings that involve multiple information modalities. While inferring a robot’s reward function from human feedback is highly efficient, the assumption that expert feedback is based on a single-mode reward function may not always hold, particularly when data are aggregated from multiple experts or a single expert across various tasks. Their method shifts the focus from single-mode to multimodal reward function acquisition. By transforming the problem of learning from multiple sources into a combined learning approach, they introduce a new method based on rankings where the expert merely scores trajectory sets.

Chandrayee Basu and fellow researchers [28] addressed the challenge of aligning robots with human preferences by acquiring reward functions, either by implicit policy learning or explicit learning from expert demonstrations. Their work extends preference-based learning with the concept of reward dynamics, which allows learning multimodal reward functions that evolve over time and in response to interactions with other agents. An important aspect of their research is the implementation of hierarchical queries, including pairwise comparison questions that help users articulate long-term relationships between the robot and its environment. Their algorithm has been shown to effectively incorporate reward dynamics through proactive choice of comparison inquiries in an autonomous driving task.

Malayandi Palan and co-researchers [16] introduced a novel paradigm that teaches robots to recognize rewards by combining the strengths of demonstration and preference. They argue that while demonstrations provide more comprehensive information than preference queries, they lack precision and vice versa. The DemPref methodology they developed uses demonstrations to establish a prior for the reward function, guiding the preference-based learning process. The effectiveness of their approach was demonstrated on both virtual and physical 7-DOF robots.

In reinforcement learning, the rewards that are artificially crafted may not accurately mirror human intentions. Our work builds upon these foundations by introducing a novel active preference learning framework that addresses the specific challenges of designing reward functions in continuous state and action spaces. This is achieved through our unique integration of a mutual information optimizer and the NUTS belief model update algorithm, enabling us to capture and leverage human-like preferences more effectively than previous methods. Upon surveying the literature on autonomous driving, it becomes apparent that the navigation tasks described within these studies serve as significant sources of inspiration. Although the navigation task in our autonomous driving project differs in terms of its implementation complexity from those outlined in the references above, it nonetheless provides a theoretical underpinning for our methodological framework. Our approach involves learning a policy by leveraging a reward function derived from active preferences, which facilitates the completion of navigation tasks across randomized locations, even without the aid of routing and high-precision maps.

3. Background

Our investigation uses a deterministic and fully observable Markov decision process (MDP) model. The MDP state at time t, denoted as $s_{t} \in S$ , and the MDP action (control input) at time t, represented by $a_{t} \in A$ , are key components of this model. We define a trajectory $ξ \in Ξ$ as a finite sequence of states and actions, expressed as ${((s_{t}, a_{t}))}_{t = 0}^{T}$ , where T indicates the time horizon of the MDP. Given the deterministic nature of the MDP, the trajectory $ξ$ can be succinctly expressed as $Λ = (s_{0}, a_{0}, \dots, s_{T}, a_{T})$ , which includes the initial state and the series of subsequent actions. Depending on the context, we may use the symbols $ξ$ and $Λ$ interchangeably to describe the trajectories.

We assume a reward function $R : Ξ \to R$ to specify the desired system behavior. The primary goal of this study is to elucidate the learning process for R. Specifically, we define R as a linear combination of chosen features $Φ : Ξ \to R^{d}$ , such that $R (ξ) = ω^{⊤} \cdot Φ (ξ)$ . Learning R requires the acquisition of the human-specified parameter $ω$ . In the context of this research, a preference query Q is represented by the set ${Q : = ξ_{1}, ξ_{2}, \dots, ξ_{k}}$ , encompassing a collection of k elements. Participants express their preferences within our framework by selecting preferred trajectories from this set; in our particular case, $k = 2$ . Our methodology for learning the reward parameter $ω$ involves interpreting human responses to comparative queries between two trajectories.

Our objective is to ascertain the precise reward parameter using the minimum number of preference queries and to identify a sequence of questions that maximizes the expected information gain regarding $ω$ . An optimal query sequence should effectively identify the human in the loop and generate “easy” questions, reducing the probability of negative feedback. Unfortunately, this question often falls into the category of NP-hard problems [29], a classification that denotes problems for which no known polynomial-time algorithm can guarantee a solution in all instances Consequently, we use a greedy strategy, i.e., we initialize the distribution over $ω$ and iteratively alternate between posing a single query and updating the distribution based on human responses. Certain query-generation policies employ mutual information algorithms.

We developed belief models to represent the $ω$ values or reward weight parameters characterized by the distribution over $ω$ . These belief models are revised to incorporate human responses. The probability that a human will select trajectory q from query Q, given the reward parameter $ω$ , is denoted by $P (q | Q, ω)$ . Utilizing the prior distribution $P (ω)$ and the human’s response q to query Q, we compute the posterior probability of $ω$ as shown in Equation (1).

(1) $P (ω | Q, q) \propto P (q | Q, ω) P (ω)$

The primary objective of this research is to develop a learning model for the weight parameter $ω$ . To achieve this, we construct a Bayesian framework predicated on $ω$ and refine the posterior distribution of $ω$ through active learning from preference queries. Our aim is to formulate a reward function incorporating the $ω$ weights that are consistent with human preferences. Subsequent sections delineate the methodology for generating preference queries and the procedure for Bayesian posterior updating of $ω$ .

4. Mutual Information

The mutual information algorithm [30] facilitates the agent’s ability to generate informative and straightforward queries, using information gain to balance uncertainty reduction with the human’s capacity to respond.

In each process stage, the objective is to identify queries that maximize the intended information gain of the reward parameter, denoted by $ω$ . This method is achieved by addressing the following optimization problem:

(2) $Q_{n}^{*} = \underset{Q_{n} = {ξ_{1}, \dots, ξ_{k}}}{arg max} I (ω; q_{n} | Q_{n}) = \underset{Q_{n} = {ξ_{1}, \dots, ξ_{k}}}{arg max} H (ω | Q_{n}) - E_{q_{n}} H (ω | q_{n}, Q_{n})$

Here, I denotes mutual information, and H represents the entropy of Shannon’s information as specified in Equation (2). Let $Ω$ denote the M samples drawn from the prior $P (ω)$ , and ≐ denote the asymptotic equality as $M \to \infty$ . Then the optimization problem can be rewritten as follows:

(3) $Q_{n}^{*} ≐ \underset{Q_{n} = {ξ_{1}, \dots, ξ_{k}}}{arg max} \frac{1}{M} \sum_{q_{n} \in Q_{n}} \sum_{ω \in Ω} {log}_{2} (\frac{M \cdot P (q_{n} | Q_{n}, ω)}{P (q_{n} | Q_{n}, ω^{'})})$

The agent responsible for solving Equation (3) to identify the issue does not follow an identical trajectory as the user. The basic query $Q = {ξ_{1}, \dots, ξ_{k}}$ does not serve as a comprehensive solution to the newly formulated optimization problem but rather indicates a global minimum.

Formulating Easy Queries

Information gain is based on balancing two primary sources of uncertainty: the agent and the human. To further clarify this concept, the optimization problem can be restated to more accurately capture the interaction between these sources of uncertainty:

(4) $Q_{n}^{*} = \underset{Q_{n} = {ξ_{1}, \dots, ξ_{k}}}{arg max} H (q_{n} | Q_{n}) - E_{ω} H (q_{n} | ω, Q_{n})$

The first term in Equation (4) represents the robot’s uncertainty about the human’s response, quantifying the robot’s ability to predict the human’s answer. The second term reflects the uncertainty in the human’s response, addressing how certain the user is in selecting option q given their reward parameter $ω$ . Without introducing any additional complexities, the optimization process for information gathering considers both types of uncertainty. It prioritizes queries where (a) the robot has limited predictability of the human’s response, yet (b) the human can provide a quick and certain answer.

5. Belief Model

As proposed in Section 3, preference-based learning involves acquiring a reward function by collecting comparison data from the user. This method employs an iterative process in which the user is queried about their preferences, and the model is subsequently refined. The model maintains a probability distribution $P (ω)$ over the possible values of $ω$ , updating it using Bayesian updates as new preferences are obtained. The nth pairwise comparison query consists of two trajectories, $ξ_{a}$ and $ξ_{b}$ , and the response to this query is denoted as $I_{n}$ , as shown in Equation (5).

(5) $I_{n} = \{\begin{matrix} + 1, & if ξ_{a}^{(n)} ≻ ξ_{b}^{(n)} \\ - 1, & if ξ_{a}^{(n)} ≺ ξ_{b}^{(n)} \end{matrix}$

$I_{n}$ is the sign function. $I_{n}$ is assigned a value of +1 to denote a preference for $ξ_{a}$ relative to $ξ_{b}$ , and a value of −1 to indicate a preference for $ξ_{b}$ over $ξ_{a}$ . Subsequently, the Bayesian updating formula adjusts the distribution $P (ω)$ following the preferences elicited in Equation (6).

(6) $P (ω | I_{n}) \propto P (I_{n} | ω) P (ω)$

The current distribution on $ω$ is denoted as $P (ω)$ , considering the responses $I_{1 : n - 1}$ . Before obtaining any preferences, a uniform prior is assumed over the search space of potential values. The process of acquiring preferences and updating the model is repeated until the desired level of accuracy is reached. This methodology facilitates the acquisition of the comparative importance among features, thereby assisting in encoding the user’s reward function.

In active preference-based reward function learning, there is a necessity for the iterative refinement of the likelihood model, denoted as $P (I_{n} | ω)$ . This study implements a refinement process incorporating a sigmoid likelihood function to address the challenge of sporadic user errors. This approach aligns with the methodology utilized by Sadigh et al. [15], thereby ensuring a robust mechanism for error correction within the analysis. The precise formulation of this sigmoid likelihood function is encapsulated in Equation (7), as delineated in the highlighted section of the text, which primarily focuses on quantifying the discrepancy between the rewards associated with two distinct trajectories.

(7) $P (I_{n} | w) = \frac{1}{1 + e x p [- I_{n} (R (ξ_{a}^{(n)}) - R (ξ_{b}^{(n)}))]}$

The purpose of the likelihood function is to generate probability values ranging from 0 to 1. Values closer to 1 imply a greater probability that a user favors trajectory A over trajectory B. The paper employs a Markov chain Monte Carlo (MCMC) methodology to produce samples from the posterior distribution $P (ω | I_{n})$ . The No-U-Turn Sampler (NUTS) [20] is specifically employed to efficiently generate samples at each iteration. These samples can be employed to refine the reward function, elucidating the relative importance among features and integrating additional features that enhance the encoding of the user’s reward function. This methodology starkly contrasts the traditional adaptive Metropolis algorithm utilized within Markov chain Monte Carlo (MCMC) simulations. The methodology outlined in the study improves the predictive and expressive capabilities of the reward model, resulting in distinct outcomes for individual users.

NUTS (No-U-Turn Sampler) Belief Model Update

The calibration of parameters within the Hamiltonian Monte Carlo (HMC) method, particularly the step size $ε$ and the number of steps L, necessitates considerable expertise and practical experience from the user. The No-U-Turn Sampler (NUTS) [20] algorithm, a refinement of the HMC framework, automates this process, thus enhancing efficiency and user-friendliness. NUTS improves upon HMC by implementing a recursive algorithm that self-determines the appropriate number of steps, integrating a slice variable u to facilitate a more streamlined computational procedure.

As an advanced Markov chain Monte Carlo (MCMC) technique, NUTS is tailored to boost sampling efficiency from complex target distributions. It diverges from the traditional HMC in that it eliminates the need for manual specification of the step parameter by introducing an adaptive mechanism for path length determination. This innovation markedly reduces the demands for manual tuning, leading to a more efficient sampling process and elevated algorithmic efficacy in intricate probabilistic models.

The NUTS algorithm leverages a Hamiltonian dynamical system to generate samples, systematically selecting an initial state and traversing the Hamiltonian dynamics trajectory in both forward and reverse directions. It adaptively modifies the path length to maintain optimal efficiency and promote convergence. This process can be conceptualized as dynamically mapping a route within the state space, where the route’s length is adjusted in response to the geometric properties of the target distribution.

NUTS operates by commencing at an initial state and recursively building a collection of potential candidate points. It advances via Hamiltonian dynamics steps, randomly choosing between forward and backward directions, and extends the path to compile a set of prospective samples. During this process, the NUTS algorithm applies a No-U-Turn Sampler criterion—a decision heuristic based on the energy change along the trajectory—to discern when to halt path extension and avoid fruitless backward movements. This heuristic ensures that the algorithm minimizes unnecessary computation and steps while covering the target distribution effectively.

By employing a recursive algorithm to construct the candidate points set, NUTS does away with the manual specification of step count parameters that characterize conventional HMC approaches. This feature significantly reduces the labor involved in parameter tuning, ensuring greater efficiency and rapid convergence. Consequently, NUTS has gained extensive recognition as a powerful and accessible MCMC algorithm, finding wide application in Bayesian statistical inference and machine learning.

The No-U-Turn Sampler (NUTS) as shown in Algorithm 1 is an extension of Hamiltonian Monte Carlo (HMC) that addresses the choice of a suitable number of steps. NUTS uses a recursive algorithm to automatically determine the trajectory length so that it stops before the simulated path turns back on itself, which is the “No U-Turn” condition. The procedure starts by initializing the first sample, choosing a suitable step size, $ϵ$ , and a maximum tree depth, D. For each iteration in the Markov Chain, the following steps are taken:

A momentum variable is sampled from a normal distribution with the same dimension as the parameter space.
Starting from the current position, the algorithm begins to construct a binary tree. For each level of the tree depth:
- The direction of exploration is decided by sampling uniformly from ${- 1, + 1}$ .
- The tree is expanded in the chosen direction by simulating Hamiltonian dynamics for a series of leapfrog steps that double with each level of depth.
- At any point during the tree expansion, if a U-turn is detected, which is determined by the scalar product of the momentum and the displacement being negative, the expansion stops.
- The maximum depth of the tree is limited to prevent excessively long trajectories.
A sample is taken from the set of points generated during the trajectory, usually by uniform sampling from the accepted positions in the tree.

This process is repeated for N iterations to form the resultant Markov chain. The step size $ϵ$ and the maximum tree depth D may be adapted during a warm-up phase to improve the performance of the sampler.

Algorithm 1: Simplified No-U-Turn Sampler (NUTS)

Our research employs the NUTS algorithm to effectively generate samples within each query iteration, thereby enhancing the alignment between belief and user models. This update improves the cosine similarity measure in the context of active preference learning.

6. Method

6.1. Environment Configuration

This study aims to apply reinforcement learning techniques to enhance the navigation capabilities of autonomous vehicles in scenarios where maps are unavailable or lack sufficient detail, thereby eliminating the need for external routing guidance. In contrast, conventional modular navigation systems [31,32] facilitate navigation between initial and terminal positions by integrating modular algorithms with detailed maps, leveraging sensing, localization, and other components, primarily for environments with comprehensive map data. These systems are designed to navigate familiar territories while optimizing the route, as Figure 3 illustrates.

Our research aims to develop a reinforcement learning algorithm that enables an autonomous vehicle to navigate without relying on external routing guidance. Instead, the vehicle utilizes its forward-facing vision system and state information extracted from autonomous driving technologies—such as speed, acceleration, and position—to navigate through poorly mapped areas. The rationale behind this approach is inspired by human driving: individuals can navigate a vehicle from point A to point B using their forward vision alone. Given that humans can drive autonomously by employing cognitive strategies, we are intrigued by the prospect of training artificial agents to mimic human-like driving behaviors. Reinforcement learning, a prevalent method through which humans acquire skills, emerges as a promising avenue for agent learning.

In our prior investigation [3], we employed reinforcement learning effectively to enable autonomous driving, achieving a notable success rate of 99% under fixed initial and final positions. However, opportunities for enhancing the generalizability of this approach, particularly in the context of navigation from arbitrary starting points, persist. We undertook this analysis because the existing reward function does not sufficiently guide the agent in random navigation tasks. The complexity of our reinforcement learning problem is amplified by the challenge of deriving a navigation policy within continuous state and action spaces. The reward function designed can inadvertently guide the agent away from the intended policy, as it is influenced by various sub-rewards, such as those related to velocity and displacement. Slight alterations in the reward weights can significantly affect the agent’s policy. Our earlier findings underscored the substantial impact of prospective reward weights on the ultimate policy. Consequently, we opt for active preference learning to craft the reward function, aligning it more closely with human design principles.

CARLA [33] was developed to cater to autonomous driving systems’ creation, training, and certification needs. This platform provides open-access digital assets encompassing cityscapes, buildings, and a variety of vehicles, all meticulously crafted for the advancement of autonomous driving research. Complementing these resources are the freely available open-source code and protocols. The simulation environment is remarkable for its flexibility, offering a wide array of options to customize sensor configurations and climatic conditions and to exercise complete control over both static and dynamic elements within the simulation. CARLA stands out as a modular and flexible API, purpose-built from the ground up to facilitate diverse research endeavors associated with autonomous driving.

A primary goal of CARLA is to foster the democratization of autonomous driving research and development by offering a user-friendly and adaptable platform. In order to satisfy a broad spectrum of use cases, the simulator has been designed to address complex driving challenges, such as the development of driving policies and the training of robust perception algorithms. Leveraging the Unreal engine for simulations, CARLA employs the ASAM OpenDRIVE standard, version 1.4, for detailed road and urban configurations. The simulation is managed through an API, predominantly implemented in Python and C++, which continues to evolve with the project’s progress.

The CARLA simulator is constructed upon a scalable client-server architecture. The server handles all simulation facets, including sensor rendering, physics calculations, and environmental updates. A dedicated GPU is recommended for optimal performance, especially in machine learning, to ensure that simulation outcomes reflect real-world conditions.

In line with established principles from prior research, our design approach for the simulation environment within CARLA is anchored in solid theoretical foundations. Our previous work constructed a specialized environment for autonomous vehicular navigation. This environment’s state space is primarily based on a forward-facing camera, capturing high-resolution, unprocessed images at 640 × 640 pixels. This method enables the autonomous vehicle to integrate inputs across various sources, creating a multimodal state space. Essential state vectors include the vehicle’s current velocity, displacement, acceleration, and the Euclidean distance to the endpoint. These vectors, such as velocity, position, and acceleration, are readily obtainable from onboard sensors, as depicted in Figure 4. Combining state vectors with the forward camera’s input shapes the multimodal state space. However, their significant memory requirements limit the use of high-quality original images [34] during training. We address this issue by employing variational autoencoders (VAEs) [35,36] to derive latent vectors, preserving the image features while reducing dimensionality and memory usage. These latent vector states are then integrated with the state vectors to create a cohesive multimodal state space.

6.2. Reward Feature Design

In our active preference learning framework, the reward structure is designed to guide the autonomous vehicle in accomplishing various tasks, with rewards formulated as sub-rewards. The primary goal of the autonomous vehicle is to reach the designated target successfully. We define a successful arrival as when the Euclidean distance between the vehicle and the endpoint is less than 0.5 m, awarding a positive reward of $r_{s u c c e s s} = + 1$ . In order to encourage collision-free navigation, a negative reward of $r_{collision} = - 3$ is applied in the event of a collision, which acts as a penalty. Furthermore, to maintain a fixed training episode duration, the episode is terminated, and the agent receives a negative reward of $r_{l o n g_t i m e} = - 1$ if it exceeds a predefined length of n. Prolonged periods of inactivity by the vehicle can hinder the training process; hence, if the speed drops below 0.05 km/h for 50 consecutive episodes, the training is concluded, and the agent receives a negative reward of $r_{t i m e o u t} = - 1$ .

Calculating the distance-based potential reward involves determining the change in the vehicle’s proximity to the endpoint between consecutive time steps. This is articulated in Equation (8), where the current and previous distances are compared:

(8) $r_{potential} = | l o c a t i o n_{end} (t) - l o c a t i o n_{vehicle} (t) | - | l o c a t i o n_{end} (t - 1) - l o c a t i o n_{vehicle} (t - 1) |$

As Equation (8) illustrates, a positive reward is attributed when the vehicle approaches the endpoint, while a negative reward signifies increased distance from it.

Furthermore, to ensure the autonomous vehicle maintains an appropriate speed, a reward mechanism is established to encourage travel within a specific range of 25 to 50 km per hour, as defined in Equation (9):

(9) $r_{velocity} = \{\begin{matrix} v - 25, & if v < 25 \\ 0, & if 25 \leq v \leq 50 \\ 50 - v, & if v > 50 \end{matrix}$

In Equation (9), v denotes the vehicle’s current speed, with the reward structure designed to incentivize adherence to the target velocity range for autonomous navigation. The reward function $R (ξ)$ is formulated as a linear combination of carefully chosen features $Φ : Ξ \to R^{d}$ , where $Φ (ξ)$ encompasses rewarding aspects such as $r_{success}$ , $r_{collision}$ , $r_{long_time}$ , $r_{time_out}$ , $r_{potential}$ , and $r_{velocity}$ . The reward weights $ω$ are determined through active preference learning methodologies, confirming a dimensionality of six for these weights, as each feature contributes to the overall reward signal guiding the vehicle’s behavior.

6.3. Generation of Query Trajectories

It is imperative to pre-generate a suite of query trajectories for the optimizer to enhance the efficiency of learning human preferences and ensure prompt responsiveness. We have developed a set of 50 predefined trajectories for this purpose, which the optimizer can utilize for optimization tasks. Our conventional modular approach [31,32] generates 20% of these trajectories, which are deemed satisfactory. An additional 5% are categorized as special trajectories, generated through policies tailored to their specific needs. For instance, a policy might require an autonomous vehicle to maintain a stationary position. Usually, our trajectories are produced by random policy generation. The optimizer selects the most suitable trajectories for user preference assessment from this collection.

7. Experiment

7.1. Active Preference Reward Learning

The query optimizer identifies the most optimal (a) and (b) trajectories from a given ensemble by maximizing information gain. A human user is then presented with these options, choosing between trajectory (a) or (b) based on personal preference, as illustrated in Figure 1. The belief model is updated following the user’s feedback, utilizing the No-U-Turn Sampler (NUTS) method. Instead of direct human preference selection, our experiments used a softmax human model to simulate the evaluation and selection of trajectories, as shown in Figure 5. We conducted two sets of experiments to assess query optimization techniques: one using a randomly generated query optimizer and the other utilizing a mutual information-based optimizer for comparative purposes. Both optimizers integrated the NUTS method for belief model updates, complemented by the adaptive Metropolis algorithm for model refinement. The protocol was designed to iterate 40 times, with the cosine similarity between the updated belief model and the softmax-based human model calculated after each iteration, as per Equation (10).

(10) $Cos ine Similarity (A, B) = \frac{A \cdot B}{∥ A ∥ ∥ B ∥}$

This method enabled a comparative analysis of the efficacy of the two optimization methodologies. The outcomes are presented in four distinct curves, as shown in Figure 6. The results indicate that the mutual information-based optimizer surpasses the random query optimizer, and the NUTS algorithm significantly enhances the belief model’s approximation capabilities. We further applied the mutual information optimizer and the NUTS update algorithm to train navigation tasks in autonomous driving systems.

7.2. Training for Autonomous Driving Navigation

In alignment with the methodologies delineated in prior academic inquiries [3], this study engages the deep reinforcement learning navigation via decision transformer (DRLNDT) algorithm for the experimental training phase. This algorithm has exhibited an impressive success rate of 99% in navigation tasks involving fixed positioning. The innovative aspect of this approach lies in its utilization of a transformer model [37,38,39,40,41,42] to discern the underlying state from a veil of historical state data—a strategy that effectively mitigates the issue of state unobservability typical of partially observable Markov decision processes (POMDPs) [43,44,45,46,47,48]. Moreover, it offers a robust solution to decision-making discrepancies resulting from sensor noise and occlusion, among other intricate factors.

In order to address the stochastic nature of navigation challenges, this research builds upon the DRLNDT framework, as previously proposed by other studies. The focus here is refining the model’s policy adaptation to facilitate movement towards randomly designated targets. The chosen hyperparameters for the reinforcement learning training regimen are meticulously described in Table 1, maintaining consistency with the parameters previously discussed in the academic literature.

Subsequently, we optimized the reward weights using an active preference method based on the mutual information optimizer in conjunction with the NUTS belief model update. The reward function was formulated utilizing the weights above. In our experimental setup, we did not investigate the impact of alternative active preference methods on the training process. However, previous experiments have demonstrated that the combination of the mutual information optimizer and the NUTS belief model update algorithm outperforms other methodologies tested as shown in Figure 7. As such, we did not find it necessary to conduct comparative training experiments during the navigation task.

A notable concern is the manageability of model updates during the initial stages of the active preference process. After approximately ten iterations, the computational demands for model updates increase substantially due to the large volume of data, resulting in significantly longer processing times for each update iteration.

In conducting experimental comparisons of various preference learning algorithms, we utilized a simulated user model that assumes a human-like decision-making process based on the softmax model. This model autonomously generates responses to preference queries, which requires approximately five hours, even after 40 iterations. In contrast, real users must actively provide preference feedback during the autonomous driving navigation training, drawing upon their human judgment.

A visual representation is indispensable to enable users to discern distinctions among different trajectories. Consequently, the time required for a user to provide an authentic response and refine their belief model substantially increases, often exceeding eight hours after 40 iterations. The management of this process demands considerable effort and necessitates consistent human participation. Active preference learning algorithms are associated with significant time and resource consumption. Future research will explore strategies to optimize query generation and enhance the efficiency of belief model updating. To date, this has precluded conducting comparative experiments to evaluate the impact of various active preference learning methods on model training.

Our training process comprises two key stages. The initial stage involves updating the active preference model. We implement a variety of policies to generate diverse trajectory sets within the CARLA environment. Concurrently, we construct a belief model based on the reward feature dimensions. Subsequently, the agent employs a mutual information-based query optimizer to generate preference queries. Human preferences, such as selecting between trajectory (a) or (b), are incorporated, and the belief model is refined accordingly using the NUTS algorithm. After 40 iterations, the belief model yields reward weights, as detailed in Table 2. In the second stage, we derive a reward function from these weights, which is then used for training the autonomous driving navigation system. Table 1 presents the hyperparameters utilized in this training process, which concludes after 50 million time steps.

Throughout the training phase, we track three metrics: mean $_l e n g t h$ , $m e a n_r e w a r d s$ , and $s u c c e s s_r a t e$ . These metrics serve to evaluate the training outcomes. The model is periodically assessed, and its performance is archived for subsequent analysis. The best-performing model is retained for further evaluation as determined by the metrics.

7.3. Results and Analysis

In our preceding study, we introduced an algorithm known as DRLNDT, which yielded high-quality results in tasks involving fixed starting and ending positions within the CARLA simulation environment, with a success rate of 99%. We refer to this particular algorithm, termed the Standard DRLNDT, as the baseline for comparison. We contrast our algorithm with this baseline to underscore its inherent benefits. This study compares a standard DRLNDT algorithm, as previously described in our research, with a novel approach for reward generation via active preference learning. In order to differentiate our novel methodology from the baseline, we designated this approach as DRLNDT + APRL. The DRLNDT + APRL framework ensures that the baseline algorithm maintains consistency across training and evaluation environments. Following 50 million training steps, we systematically record the experiment’s mean $_l e n g t h$ , $m e a n_r e w a r d s$ , and $s u c c e s s_r a t e$ . Interim evaluations are conducted to monitor these metrics at predefined intervals. The model that achieves the highest reward is selected as the optimal model for the final evaluation, which compares the performance of the DRLNDT + APRL method against the baseline.

In the final evaluation, we investigate the average response times of various algorithms within autonomous driving systems, evaluating their decision-making capabilities in novel scenarios to ensure the safety of these systems. The variable $m e a n_l e n g t h$ represents the mean distance traveled during each episode during training. The variable $m e a n_l e n g t h$ represents the average length of episodes calculated over batches of 100 episodes. A smaller average step length indicates more favorable experimental results, suggesting a decreased time for the autonomous vehicle to accomplish its goal, provided it maintains a high success rate. This is attributable to the fixed duration of each phase. The variable $m e a n_r e w a r d$ indicates the average reward accumulated across numerous iterations, with a higher value suggesting a greater probability of the algorithm’s operational success. The variable $m e a n_r e w a r d$ represents the average reward per episode, calculated over intervals of 100 episodes throughout the training period. Additionally, the $s u c c e s s_r a t e$ variable measures the proportion of tasks completed to the total number of attempts. The $s u c c e s s_r a t e$ represents the proportion of tasks completed successfully to the total number of episodes from the start to the conclusion of the training period. An increased success rate signifies enhanced algorithmic efficacy. It is crucial to emphasize that only instances where the autonomous vehicle reaches the destination without human intervention and without engaging in risky maneuvers are considered successful. Thus, the success rate serves as a critical metric for the safety of autonomous driving systems.

The experimental data are presented in Table 2, Table 3 and Table 4, and visually represented in Figure 6 and Figure 8. In Figure 8, the x-axis represents the timestep, while the y-axis corresponds to $m e a n_r e w a r d$ , $m e a n_l e n g t h$ , and $s u c c e s s_r a t e$ , respectively. Graphical representations with the timestep on the x-axis offer a nuanced view of the dynamics in reward patterns during an agent’s learning process. Such visualizations are instrumental in monitoring both the advancement and the variability in the agent’s learning. However, interpreting data from extensive time series can introduce complexity and potential confusion.

The timestep serves as a discrete unit of time that characterizes the interaction between the environment and the agent. It is crucial for tracking learning progress, including the frequency of policy or value function updates, as well as for evaluating the agent’s performance. Analyzing the average reward and success rate across various timesteps offers insights into the learning process. We approach the experiment primarily as a reinforcement learning task, with the performance evaluation focusing on the reinforcement learning aspect and disregarding traditional autonomous driving evaluation metrics.

Figure 6 illustrates the outcomes of diverse active preference learning systems. Cosine similarity is used to gauge the effectiveness of active preference learning; a higher value indicates a closer resemblance between the user model and the belief model, which, in turn, implies that the belief model captures the user’s preferences more accurately. Consequently, the reward weights derived from the belief model are more reflective of human intent.

We conducted four experiments using distinct query optimizers and belief model updating methods. The mutual information optimizer and the NUTS (No-U-Turn Sampler) approach were applied in novel ways. Mutual information methods reduce the agent’s uncertainty by gleaning informative knowledge, resulting in simpler yet insightful questions that consider human response capabilities. The NUTS algorithm enhances the sample generation during belief model updates, obviating the need for parameter tuning and proving more efficient and user-friendly. Figure 6 demonstrates that our algorithm performs better than alternative methods, with stability analysis revealing the preference approach’s advantage. Our method tends to stabilize after approximately 20 iterations. While the mutual information approach also acquires queries effectively, the classic Metropolis–Hastings method achieves a relatively higher cosine similarity after several updates. The figure suggests that the orange curve is less smooth than the red, indicating that the competitor algorithm’s stability is inferior to ours. Based on this analysis, we selected the mutual information optimizer and NUTS to calculate the reward weights for active preference learning.

Table 2 shows the reward weights derived from preference learning, which differ from those designed manually. The magnitude of the $r_{c o l l i s i o n}$ reward weight is a notable discrepancy. Our manually designed weights heavily penalize collisions to encourage safe driving policies. However, the weights learned through preference learning for avoiding collisions are significantly smaller. Our experiments revealed that increasing the collision penalty can lead to unintended behaviors, such as the vehicle rotating in place or becoming stationary, which do not align with our goals. Moreover, the weight assigned to $r_{t i m e}$ remains small in the active preference learning framework. The policy we derived through this method is significantly more effective than those designed by humans. While engineered rewards are intended to align with human cognitive processes, they often miss the complex interactions between incentives and may not accurately represent inherent human desires. Consequently, sub-incentives’ emergence adds further intricacies, making it difficult for humans to create rewards that truly resonate with an agent’s cognitive processes.

We were organized into four groups to conduct experimental investigations into the effects of active preference learning on reinforcement learning algorithms’ policies. The results of these experiments are depicted in Figure 8 and presented in Table 3 and Table 4. The algorithms examined included DRLNDT + APRL, DRLNDT [3], SAC + APRL, and SAC. The DRLNDT algorithm’s performance was compared against other methods using baseline comparisons and was found to be superior among the compared algorithms. The results suggest that policies not employing the reward function derived from active preference learning were notably less effective than those utilizing manually generated rewards.

A comparative analysis between the SAC and DRLNDT algorithms suggests that these approaches have limited efficacy in policy learning, especially in environments with complex dynamics. Their utility seems to be confined primarily to simpler navigational tasks. Moreover, in a contemporary setting where both the initial and terminal positions undergo random transformations, it was observed that manually designed reward structures failed to provide the necessary reinforcement for cultivating a robust policy, as evidenced by an inadequate ratio of successful to unsuccessful trajectories.

An enhanced strategy involved integrating SAC with APRL. Our findings indicate that active preference learning benefits stronger models and can uplift the performance of weaker ones. Occasionally, SAC would learn suboptimal trajectories due to its intrinsic limitations, leading to a higher $m e a n_l e n g t h$ parameter than the DRLNDT + APRL findings.

When assessing algorithmic performance, our primary metric was the $s u c c e s s_r a t e$ , which offers insights into an algorithm’s efficacy and reliability. Additionally, the $m e a n_r e w a r d$ was considered as a secondary measure for comparison. Upon reviewing Table 4, it was evident that DRLNDT + APRL exhibited the highest success rate. The data in the table corresponds to the reward evaluation of the model that yielded the highest rewards from our training iterations. As a pragmatic approach, we designated the model with the highest rewards as optimal. The experimental outcomes suggest that the reward function crafted through active preference is significantly more advantageous than its manually designed counterpart, aligning well with human intuition. However, as environmental complexity increases, the manually created incentives prove insufficient, leading to a degradation in policy performance.

Furthermore, we analyzed the response times of the various algorithms. The unit for response time is in seconds (s), calculated using the Python framework. Response time indicates an algorithm’s decision-making capabilities and ability to manage unexpected situations; shorter response times are critical for ensuring the safety of autonomous driving in emergency scenarios. Our algorithm’s response time was as quick as 0.005 s, which is negligible in practical terms, particularly as it would be further reduced within a C++ framework.

This study uses active preference learning to construct a reward function attuned to human preferences. The primary aim is to empower the agent to develop a policy that aligns with human criteria, contingent upon the reward structure and the immediate environment. Our approach to active preference learning integrates a novel methodology that weddings the mutual information algorithm with the NUTS (No-U-Turn Sampler) algorithm. The mutual information algorithm is designed to balance acquiring information gain with reducing the agent’s uncertainty, leading the agent to pose direct and informative questions, considering the human’s capacity to respond. The NUTS algorithm obviates the need for meticulous manual parameter tuning, enhancing computational efficiency and user accessibility. This algorithm improves the—belief model’s iterative refinement by generating samples more efficiently concurrently. Our synthesis of these algorithms demonstrates enhanced efficacy in computing cosine similarity between the user and belief models, surpassing previous methods. These findings affirm the belief model’s capacity to assimilate human preferences effectively.

We subsequently employ the reward weights derived from active preference learning to facilitate policy learning. This method enables an autonomous driving vehicle to learn a policy for navigating between randomized initial and terminal positions in an environment devoid of high-precision maps and predefined routing. The vehicle operates solely on forward-facing vision and self-collected data as the state representation. Our algorithm emerges as a robust and fundamental approach, as evidenced by our experimental results. It significantly outperforms policies trained using manually constructed reward functions. The efficacy of the reward function, developed through active preference learning, is authenticated by its ability to resonate with human cognitive processes and to train policies that meet human-driven criteria. Our work establishes a crucial foundational framework for the progression of autonomous driving intelligence.

Significant challenges persist in the practical application of our method, particularly concerning the safety of reinforcement learning-based autonomous driving systems in real-world settings. To ensure the generalizability and transferability of reinforcement learning algorithms, it is crucial to maintain congruence between simulations and actual environments. This congruence lays the groundwork for transitioning from virtual to real scenarios. Regrettably, our laboratory lacks both autonomous driving vehicles and realistic autonomous driving simulation scenarios that align with the real world, precluding the validation of our algorithms in a genuine context. Nonetheless, we selected the CARLA simulator as our testbed due to its fidelity in emulating data from real-world sensors. We are optimistic that, in the future, our algorithms will successfully achieve virtual-to-real transfer, which remains an avenue for our ongoing research.

8. Conclusions and Future Outlook

The adoption of active preference learning methodologies has the potential to revolutionize autonomous driving systems. By enabling reinforcement learning models to align with user expectations, these systems can achieve significantly improved success rates. Our research has empirically demonstrated that our algorithm outperforms existing techniques, with room for further refinement. This development represents a significant advancement in equipping autonomous agents with human-like driving capabilities. In our approach, we relied exclusively on the forward-facing camera and data gathered by the autonomous vehicle, purposefully avoiding reliance on high-precision maps or predefined routing strategies. Through a reward-based mechanism, we guided the agent from its starting point to a designated endpoint, exceeding the performance of traditional benchmark methods. This research validates the hypothesis that autonomous vehicles can operate optimally in environments where initial and terminal positions are randomly determined.

Although starting and ending points in our experiments were randomly selected from a predetermined set, the map did not involve the random generation of these points. Nonetheless, we observed that the duration required for computation during the active preference learning process was excessive, notably during later phases of model updates and query generation. This issue poses a significant challenge for active preference learning, particularly concerning human interaction. Consequently, future research will focus on the batch processing of queries to augment the efficiency of active preference learning and to curtail the duration of model training.

Author Contributions

L.G.: Conceived and designed the experiments; Analyzed and interpreted the data; Performed the experiments; Assisted in the analysis of the data; Wrote the paper. X.Z.: Provided critical feedback and helped shape the research, analysis, and manuscript. Y.L.: Offered technical and methodological support and was involved in discussing the results and revising the manuscript critically for important intellectural content. All authors have read and agreed to the published version of the manuscript.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

Author Yongqiang Li was employed by the company Mogo Auto Intelligence and Telematics Information Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figures and Tables

View Image - Figure 1. (a,b) present the trajectory queries used in the active preference procedure. The human participant chooses between trajectory (a) and (b) based on personal preference.

Figure 1. (a,b) present the trajectory queries used in the active preference procedure. The human participant chooses between trajectory (a) and (b) based on personal preference.

View Image - Figure 2. This figure illustrates the process of refining models for active preference learning in reward functions. It encompasses several key components: the human model, the query optimizer, the belief model, the evaluation process, and the generation of trajectories for autonomous driving.

Figure 2. This figure illustrates the process of refining models for active preference learning in reward functions. It encompasses several key components: the human model, the query optimizer, the belief model, the evaluation process, and the generation of trajectories for autonomous driving.

View Image - Figure 3. This figure presents the simulation environment utilized in our experiments, featuring a Tesla Model 3 at the starting point. The map is Town10HD, a scenario rife with structures, greenery, and obstacles. Our navigation task relaxes adherence to traffic laws and does not consider dynamic obstacles.

Figure 3. This figure presents the simulation environment utilized in our experiments, featuring a Tesla Model 3 at the starting point. The map is Town10HD, a scenario rife with structures, greenery, and obstacles. Our navigation task relaxes adherence to traffic laws and does not consider dynamic obstacles.

View Image - Figure 4. The figure displays the simulation model used for navigation, details of the vehicle’s sensor arrangement for capturing forward vision, and the accessible state vectors. These vectors, including velocity, position, and acceleration, can be easily obtained from onboard sensors.

Figure 4. The figure displays the simulation model used for navigation, details of the vehicle’s sensor arrangement for capturing forward vision, and the accessible state vectors. These vectors, including velocity, position, and acceleration, can be easily obtained from onboard sensors.

View Image - Figure 5. This figure depicts the evaluation process of the query optimizer, which involves updating the belief model. We employ a simulation of the human model to automate the implementation of preferences. The effectiveness of different preference algorithms is gauged by calculating the cosine similarity between the human and belief models.

Figure 5. This figure depicts the evaluation process of the query optimizer, which involves updating the belief model. We employ a simulation of the human model to automate the implementation of preferences. The effectiveness of different preference algorithms is gauged by calculating the cosine similarity between the human and belief models.

View Image - Figure 6. This diagram represents the cosine similarity between the user and belief models as various active preference algorithms are employed. Cosine similarity is a metric to assess the alignment between the two models; a higher value indicates a better representation of user preferences by the belief model. The data presented demonstrate a notable improvement of our proposed approach over other active preference learning algorithms.

Figure 6. This diagram represents the cosine similarity between the user and belief models as various active preference algorithms are employed. Cosine similarity is a metric to assess the alignment between the two models; a higher value indicates a better representation of user preferences by the belief model. The data presented demonstrate a notable improvement of our proposed approach over other active preference learning algorithms.

View Image - Figure 7. Thisfigure illustrates our approach to enhancing belief models through active preference learning, which involves engaging actual individuals in providing preference responses to queries. Subsequently, the belief model is updated to create reward weights.

Figure 7. Thisfigure illustrates our approach to enhancing belief models through active preference learning, which involves engaging actual individuals in providing preference responses to queries. Subsequently, the belief model is updated to create reward weights.

View Image - Figure 8. This figure presents the training (a,c,e) and evaluation (b,d,f) results of the four policies: DRLNDT + APRL, DRLNDT (Baseline), SAC + APRL, and Standard SAC. The policies are assessed based on their [Forumla omitted. See PDF.], [Forumla omitted. See PDF.], [Forumla omitted. See PDF.]. The x-axis represents the timestep, while the y-axis corresponds to the [Forumla omitted. See PDF.], [Forumla omitted. See PDF.], and [Forumla omitted. See PDF.], respectively.

Figure 8. This figure presents the training (a,c,e) and evaluation (b,d,f) results of the four policies: DRLNDT + APRL, DRLNDT (Baseline), SAC + APRL, and Standard SAC. The policies are assessed based on their [Forumla omitted. See PDF.], [Forumla omitted. See PDF.], [Forumla omitted. See PDF.]. The x-axis represents the timestep, while the y-axis corresponds to the [Forumla omitted. See PDF.], [Forumla omitted. See PDF.], and [Forumla omitted. See PDF.], respectively.

Table 1

Hyperparameter list.

	Value	Description
Hyperparameters
Camera Resolution	640 × 640 × 3	Dimension of the input image
Dim. Latent Vector	256	Dimension of the latent vector
VAE Learning Rate	1.00 $\times 10^{- 4}$	VAE Learning Rate
Batch Size (VAE)	16	Batch size for VAE training
Training Data Size	65,536	Number of VAE training samples
Sequence Length (n)	50	Length of the history state sequence
Number of Heads	6	Number of heads in the Transformer
Dim. per Head	64	Dimension of each head in the Transformer
Dim. MLP Layer	512	Dimensions of MLPs in the Transformer
Buffer Size (SAC)	65,536	Buffer size in the SAC algorithm
Learning Rate (SAC)	1.00 $\times 10^{- 4}$ ∼1.00 $\times 10^{- 3}$	Learning rate range in SAC
Discount Factor ( $γ$ )	0.99	Discount factor in SAC
Batch Size (SAC)	128	Batch size for SAC training

Table 2

Reward weights.

Reward Feature	APRL	Human Design
$r_{success}$	0.4099	0.20
$r_{collision}$	0.038	0.40
$r_{long_time}$	0.5687	0.50
$r_{time_out}$	−0.1763	0.50
$r_{potential}$	0.6869	0.40
$r_{velocity}$	0.0643	0.30

Table 3

Comparison of Experimental Training Processes.

Policy	Training Metrics			Description
	mean $_length$	mean $_reward$	$success_rate$
DRLNDT + APRL	106.9938	22.7781	0.5654	DRLNDT with Active Preference Learning
DRLNDT	80.4047	−23.9450	0.0406	Standard DRLNDT (Baseline)
SAC + APRL	150.5300	45.3424	0.4933	Soft Actor–Critic with Active Preference Learning
SAC	136.9227	3.7050	0.0087	Standard Soft Actor–Critic

Table 4

Comparison of experimental evaluation processes.

Policy	Evaluation Metrics
	mean _length	mean _reward	success _rate	response_time (s)
DRLNDT + APRL	187.3512	43.6059	0.5901	0.0051
DRLNDT (Baseline)	106.5013	32.1885	0.11	0.0053
SAC + APRL	149.5733	48.3275	0.411	0.0026
Standard SAC	116.5532	43.1276	0.05	0.0027

References

1. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing atari with deep reinforcement learning. arXiv; 2013; arXiv: 1312.5602

2. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G. et al. Human-level control through deep reinforcement learning. Nature; 2015; 518, pp. 529-533. [DOI: https://dx.doi.org/10.1038/nature14236]

3. Ge, L.; Zhou, X.; Li, Y.; Wang, Y. Deep reinforcement learning navigation via decision transformer in autonomous driving. Front. Neurorobotics; 2024; 18, 1338189. [DOI: https://dx.doi.org/10.3389/fnbot.2024.1338189]

4. Arulkumaran, K.; Deisenroth, M.P.; Brundage, M.; Bharath, A.A. A brief survey of deep reinforcement learning. arXiv; 2017; arXiv: 1708.05866[DOI: https://dx.doi.org/10.1109/MSP.2017.2743240]

5. Toromanoff, M.; Wirbel, E.; Moutarde, F. End-to-End Model-Free Reinforcement Learning for Urban Driving using Implicit Affordances. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Seattle, WA, USA, 13–19 June 2020; pp. 7153-7162.

6. Kendall, A.; Hawke, J.; Janz, D.; Mazur, P.; Reda, D.; Allen, J.M.; Lam, V.D.; Bewley, A.; Shah, A. Learning to drive in a day. Proceedings of the 2019 International Conference on Robotics and Automation (ICRA); Montreal, QC, Canada, 20–24 May 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 8248-8254.

7. Chen, J.; Yuan, B.; Tomizuka, M. Model-free Deep Reinforcement Learning for Urban Autonomous Driving. Proceedings of the 2019 IEEE Intelligent Transportation Systems Conference (ITSC); Auckland, New Zealand, 27–30 October 2019; pp. 2765-2771. [DOI: https://dx.doi.org/10.1109/ITSC.2019.8917306]

8. Ye, F.; Zhang, S.; Wang, P.; Chan, C.Y. A survey of deep reinforcement learning algorithms for motion planning and control of autonomous vehicles. Proceedings of the 2021 IEEE Intelligent Vehicles Symposium (IV); Nagoya, Japan, 11–17 July 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1073-1080.

9. Morales, E.F.; Murrieta-Cid, R.; Becerra, I.; Esquivel-Basaldua, M.A. A survey on deep learning and deep reinforcement learning in robotics with a tutorial on deep reinforcement learning. Intell. Serv. Robot.; 2021; 14, pp. 773-805. [DOI: https://dx.doi.org/10.1007/s11370-021-00398-z]

10. Kiran, B.R.; Sobh, I.; Talpaert, V.; Mannion, P.; Al Sallab, A.A.; Yogamani, S.; Pérez, P. Deep reinforcement learning for autonomous driving: A survey. IEEE Trans. Intell. Transp. Syst.; 2021; 23, pp. 4909-4926. [DOI: https://dx.doi.org/10.1109/TITS.2021.3054625]

11. Hayes, C.F.; Rădulescu, R.; Bargiacchi, E.; Källström, J.; Macfarlane, M.; Reymond, M.; Verstraeten, T.; Zintgraf, L.M.; Dazeley, R.; Heintz, F. et al. A practical guide to multi-objective reinforcement learning and planning. Auton. Agents Multi-Agent Syst.; 2022; 36, 26. [DOI: https://dx.doi.org/10.1007/s10458-022-09552-y]

12. Abbeel, P.; Ng, A.Y. Apprenticeship learning via inverse reinforcement learning. Proceedings of the Twenty-First International Conference on Machine Learning; Banff, AB, Canada, 4–8 July 2004; 1.

13. Ziebart, B.D.; Maas, A.L.; Bagnell, J.A.; Dey, A.K. Maximum entropy inverse reinforcement learning. Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence; Chicago, IL, USA, 13–17 July 2008; Volume 8, pp. 1433-1438.

14. Hadfield-Menell, D.; Milli, S.; Abbeel, P.; Russell, S.J.; Dragan, A. Inverse reward design. Adv. Neural Inf. Process. Syst.; 2017; 30, pp. 6768-6777.

15. Sadigh, D.; Dragan, A.D.; Sastry, S.S.; Seshia, S.A. Active Preference-Based Learning of Reward Functions. Proceedings of the Robotics: Science and Systems; Cambridge, MA, USA, 12–16 July 2017.

16. Palan, M.; Landolfi, N.C.; Shevchuk, G.; Sadigh, D. Learning reward functions by integrating human demonstrations and preferences. arXiv; 2019; arXiv: 1906.08928

17. Bıyık, E.; Talati, A.; Sadigh, D. Aprel: A library for active preference-based reward learning algorithms. Proceedings of the 2022 17th ACM/IEEE International Conference on Human-Robot Interaction (HRI); Sapporo, Japan, 7–10 March 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 613-617.

18. Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. Proceedings of the International Conference on Machine Learning PMLR; Stockholm, Sweden, 10–15 July 2018; pp. 1861-1870.

19. Haarnoja, T.; Zhou, A.; Hartikainen, K.; Tucker, G.; Ha, S.; Tan, J.; Kumar, V.; Zhu, H.; Gupta, A.; Abbeel, P. et al. Soft actor-critic algorithms and applications. arXiv; 2018; arXiv: 1812.05905

20. Hoffman, M.D.; Gelman, A. The No-U-Turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. J. Mach. Learn. Res.; 2014; 15, pp. 1593-1623.

21. Metropolis, N.; Rosenbluth, A.W.; Rosenbluth, M.N.; Teller, A.H.; Teller, E. Equation of state calculations by fast computing machines. J. Chem. Phys.; 1953; 21, pp. 1087-1092. [DOI: https://dx.doi.org/10.1063/1.1699114]

22. Hastings, W.K. Monte Carlo Sampling Methods Using Markov Chains and Their Applications. Biometrika; 1970; 57, pp. 97-109. [DOI: https://dx.doi.org/10.1093/biomet/57.1.97]

23. Hewson, P. Bayesian Data Analysis 3rd edn A. Gelman, J.B Carlin, H.S. Stern, D.B. Dunson, A. Vehtari and D.B. Rubin, 2013 Boca Raton, Chapman and Hall–CRC 676 pp.,£ 44.99 ISBN 1-439-84095-4. J. R. Stat. Soc. Ser. (Stat. Soc.); 2015; 178, 301. [DOI: https://dx.doi.org/10.1111/j.1467-985X.2014.12096_1.x]

24. Akrour, R.; Schoenauer, M.; Sebag, M. April: Active preference learning-based reinforcement learning. Machine Learning and Knowledge Discovery in Databases, Proceedings of the European Conference, ECML PKDD 2012, Bristol, UK, 24–28 September 2012, Proceedings, Part II 23; Springer: Berlin/Heidelberg, Germany, 2012; pp. 116-131.

25. Christiano, P.F.; Leike, J.; Brown, T.; Martic, M.; Legg, S.; Amodei, D. Deep reinforcement learning from human preferences. Proceedings of the 31st Conference on Neural Information Processing Systems; Long Beach, CA, USA, 4–9 December 2017; Volume 30.

26. Bıyık, E.; Losey, D.P.; Palan, M.; Landolfi, N.C.; Shevchuk, G.; Sadigh, D. Learning reward functions from diverse sources of human feedback: Optimally integrating demonstrations and preferences. Int. J. Robot. Res.; 2022; 41, pp. 45-67. [DOI: https://dx.doi.org/10.1177/02783649211041652]

27. Myers, V.; Biyik, E.; Anari, N.; Sadigh, D. Learning multimodal rewards from rankings. Proceedings of the Conference on Robot Learning, PMLR; Auckland, New Zealand, 14–18 December 2022; pp. 342-352.

28. Basu, C.; Bıyık, E.; He, Z.; Singhal, M.; Sadigh, D. Active learning of reward dynamics from hierarchical queries. Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS); Macau, China, 3–8 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 120-127.

29. Ailon, N. An Active Learning Algorithm for Ranking from Pairwise Preferences with an Almost Optimal Query Complexity. J. Mach. Learn. Res.; 2010; 13, pp. 137-164.

30. Bıyık, E.; Palan, M.; Landolfi, N.C.; Losey, D.P.; Sadigh, D. Asking easy questions: A user-friendly approach to active reward learning. arXiv; 2019; arXiv: 1910.04365

31. González, D.; Pérez, J.; Milanés, V.; Nashashibi, F. A review of motion planning techniques for automated vehicles. IEEE Trans. Intell. Transp. Syst.; 2015; 17, pp. 1135-1145. [DOI: https://dx.doi.org/10.1109/TITS.2015.2498841]

32. Paden, B.; Čáp, M.; Yong, S.Z.; Yershov, D.; Frazzoli, E. A survey of motion planning and control techniques for self-driving urban vehicles. IEEE Trans. Intell. Veh.; 2016; 1, pp. 33-55. [DOI: https://dx.doi.org/10.1109/TIV.2016.2578706]

33. Dosovitskiy, A.; Ros, G.; Codevilla, F.; Lopez, A.; Koltun, V. CARLA: An open urban driving simulator. Proceedings of the Conference on Robot Learning, PMLR; Mountain View, CA, USA, 13–15 November 2017; pp. 1-16.

34. Andrychowicz, M.; Raichuk, A.; Stańczyk, P.; Orsini, M.; Girgin, S.; Marinier, R.; Hussenot, L.; Geist, M.; Pietquin, O.; Michalski, M. et al. What matters in on-policy reinforcement learning? A large-scale empirical study. arXiv; 2020; arXiv: 2006.05990

35. Wei, R.; Garcia, C.; El-Sayed, A.; Peterson, V.; Mahmood, A. Variations in variational autoencoders-a comparative evaluation. IEEE Access; 2020; 8, pp. 153651-153670. [DOI: https://dx.doi.org/10.1109/ACCESS.2020.3018151]

36. Loaiza-Ganem, G.; Cunningham, J.P. The continuous Bernoulli: Fixing a pervasive error in variational autoencoders. arXiv; 2019; arXiv: 1907.06845

37. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst.; 2017; 30, pp. 6000-6010.

38. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv; 2020; arXiv: 2010.11929

39. Choromanski, K.; Likhosherstov, V.; Dohan, D.; Song, X.; Gane, A.; Sarlos, T.; Hawkins, P.; Davis, J.; Mohiuddin, A.; Kaiser, L. et al. Rethinking attention with performers. arXiv; 2020; arXiv: 2009.14794

40. Ding, M.; Zhou, C.; Yang, H.; Tang, J. Cogltx: Applying bert to long texts. Adv. Neural Inf. Process. Syst.; 2020; 33, pp. 12792-12804.

41. Parisotto, E.; Song, F.; Rae, J.; Pascanu, R.; Gulcehre, C.; Jayakumar, S.; Jaderberg, M.; Kaufman, R.L.; Clark, A.; Noury, S. et al. Stabilizing transformers for reinforcement learning. Proceedings of the International Conference on Machine Learning, PMLR; Virtual, 13–18 July 2020; pp. 7487-7498.

42. Chen, L.; Lu, K.; Rajeswaran, A.; Lee, K.; Grover, A.; Laskin, M.; Abbeel, P.; Srinivas, A.; Mordatch, I. Decision transformer: Reinforcement learning via sequence modeling. Adv. Neural Inf. Process. Syst.; 2021; 34, pp. 15084-15097.

43. Yu, Y.; Si, X.; Hu, C.; Zhang, J. A review of recurrent neural networks: LSTM cells and network architectures. Neural Comput.; 2019; 31, pp. 1235-1270. [DOI: https://dx.doi.org/10.1162/neco_a_01199]

44. Hausknecht, M.; Stone, P. Deep recurrent q-learning for partially observable mdps. Proceedings of the AAAI Fall Symposium on Sequential Decision Making for Intelligent Agents (AAAI-SDMIA15); Arlington, VA, USA, 12–14 November 2015.

45. Jaakkola, T.; Singh, S.; Jordan, M. Reinforcement learning algorithm for partially observable Markov decision problems. Adv. Neural Inf. Process. Syst.; 1994; 7, pp. 345-352.

46. Ghosh, D.; Rahme, J.; Kumar, A.; Zhang, A.; Adams, R.P.; Levine, S. Why generalization in rl is difficult: Epistemic pomdps and implicit partial observability. Adv. Neural Inf. Process. Syst.; 2021; 34, pp. 25502-25515.

47. Kaelbling, L.P.; Littman, M.L.; Cassandra, A.R. Planning and acting in partially observable stochastic domains. Artif. Intell.; 1998; 101, pp. 99-134. [DOI: https://dx.doi.org/10.1016/S0004-3702(98)00023-X]

48. Puterman, M.L. Markov Decision Processes: Discrete Stochastic Dynamic Programming; John Wiley & Sons: Hoboken, NJ, USA, 2014.

Word count: 11152

Show less

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

This study presents a method based on active preference learning to overcome the challenges of designing reward functions for autonomous navigation. Results obtained from training with artificially designed reward functions may not accurately reflect human intentions. We focus on the limitations of traditional reward functions, which often fail to facilitate complex tasks in continuous state spaces. We propose the adoption of active preference learning to resolve these issues and to generate reward functions that align with human preferences. This approach leverages an individual’s subjective preferences to guide an agent’s learning process, enabling the creation of reward functions that reflect human desires. We utilize mutual information to generate informative queries and apply information gained to balance the agent’s uncertainty with the human’s response capacity, encouraging the agent to pose straightforward and informative questions. We further employ the No-U-Turn Sampler (NUTS) method to refine the belief model, which outperforms that constructed using the Metropolis algorithm. Subsequently, we retrain the agent using reward weights derived from active preference learning. As a result, our autonomous driving vehicle can navigate between random starting and ending points without dependence on high-precision maps or routing, relying solely on its forward vision. We validate our approach’s performance within the CARLA simulation environment. Our algorithm significantly improved the success rate of autonomous driving navigation tasks that originally failed due to artificially designed rewards, increasing it to approximately 60%. Experimental results show significant improvement over the baseline algorithm, providing a solid foundation for enhancing navigation capabilities in autonomous driving systems and advancing the field of autonomous driving intelligence.

Details

Title

Designing Reward Functions Using Active Preference Learning for Reinforcement Learning in Autonomous Driving Navigation

Author

Ge, Lun¹

; Zhou, Xiaoguang¹; Li, Yongqiang²

¹ School of Modern Post (School of Automation), Beijing University of Posts and Telecommunications, Beijing 100876, China; [email protected]
² Mogo Auto Intelligence and Telematics Information Technology Co., Ltd., Beijing 100010, China; [email protected]

First page

4845

Publication year

2024

Publication date

2024

Publisher

MDPI AG

e-ISSN

20763417

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/app14114845

ProQuest document ID

3067412729

Designing Reward Functions Using Active Preference Learning for Reinforcement Learning in Autonomous Driving Navigation

Jump to:

Full text

Abstract

Details

Suggested sources