Hybrid Supervised and Reinforcement Learning for

Full text

Turn on search term navigation

1. Introduction

With the rapid advancement of autonomous driving (AD) technology, autonomous vehicles are expected to deliver safer and more comfortable riding experiences [1,2]. AD systems typically consist of several core modules, including sensing [3], perception [4], decision [5], planning [6], and control [7]. Among them, the controller module serves as the core execution unit of the system and is responsible for translating decision-making plans into physical vehicle actions. Developing a real-time, accurate, stable, and comfortable tracking control system is critical for ensuring an exceptional passenger experience, yet achieving this goal remains challenging.

In path tracking, the lateral control of the tracking system should minimize tracking errors while restricting computational load to guarantee real-time performance and robustness. Furthermore, the system should be designed to alleviate potential Motion Sickness (MS) [8] experienced by passengers engaged in non-driving activities. MS is primarily attributed to sensory conflicts between visual and vestibular systems [9], and its severity can be quantified by the Motion Sickness Dose Value (MSDV) model [10]. Recent studies demonstrate that non-driving-related activities exacerbate MS by reducing passengers’ visual coupling with vehicle motion [11,12]. Consequently, the trade-off between precise, stable tracking control and reducing passengers’ MS remains an unresolved issue requiring urgent attention in autonomous vehicle development.

Researchers have developed various methods to address path-tracking challenges [13]. Proportional–Integral–Derivative (PID) controllers [14,15] remain popular due to their simple formulation and ease of maintenance. However, PID controllers are highly sensitive to direct feedback errors and require parameter retuning to adapt to varying scenarios, such as changing road conditions or complex path-tracking tasks. This necessitates repetitive gain scheduling, making PID controllers costly and time-consuming to calibrate and test.

Pure Pursuit (PP) controllers [16], based on geometric models, are widely adopted in real-world autonomous systems due to their structural simplicity. However, PP controllers lack predictive capability for future state variations, leading to trajectory oscillations in high-speed or high-curvature tracking scenarios. Their performance also depends heavily on appropriate look-ahead distance settings. To address this, more sophisticated algorithms (e.g., the Stanley method) [17] dynamically adjust look-ahead distances based on vehicle speed. However, such approaches fall under feedforward control and struggle to handle complex driving conditions requiring real-time vehicle feedback.

Other studies employ Linear Quadratic Regulator (LQR) methods [18] for path tracking, but their reliance on linearized vehicle models introduces significant simplifications, potentially compromising accuracy. Model predictive control (MPC) [19,20,21], which generates optimal control commands through iterative prediction horizons, offers improved performance. However, MPC’s precision is also affected by model simplifications, and its computational cost remains high due to the need to solve optimization problems iteratively at each time step, posing challenges for meeting real-time tracking requirements.

While recent advances show the integration of MS mitigation strategies in controller design for AD systems (e.g., [22]), most existing approaches still rely heavily on simplified linear models. This focus on model simplicity often overlooks the complex dynamics associated with MS effects, which can lead to suboptimal tracking performance and reduced passenger comfort. The lack of systematic frameworks to address MS mitigation remains a critical challenge to fully realize the potential of AD systems.

To address the inherent limitations of conventional controllers in adaptability and their failure to mitigate MS degradation, this study proposes a Hybrid Supervised– Reinforcement Learning (HSRL) framework. The framework starts with expert demonstration collection, followed by supervised pre-training of the initial policy through Behavior Cloning (BC). This phase ensures accelerated policy convergence and stabilized training dynamics by aligning the policy distribution with expert priors. Subsequently, an online Reinforcement Learning (RL) paradigm fine-tunes the policy via environmental interaction, enabling adaptation to stochastic scenarios while enhancing robustness and generalization.

The main contributions of this work can be summarized as follows:

(1) Dual-Stage Learning Mechanism: This study introduces a supervised pre-training-RL dual-stage optimization mechanism, integrating expert knowledge-guided supervised pre-training with the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm. During the supervised pre-training phase, BC is employed to optimize the controllability of initial policy gradient update directions, effectively mitigating the inherent sample efficiency limitations of RL algorithms. This approach significantly reduces the resource consumption and training time required for model development. In the online RL phase, the pre-trained policy is further refined through iterative interaction, which enhances the model’s performance ceiling and generalization capability across diverse scenarios.

(2) MS Metric Integration: In this study, the reward mechanism of the RL algorithm innovatively integrates the standardized MSDV. Compared with traditional methods that rely solely on acceleration or vehicle jerk as evaluation metrics, this improvement significantly enhances the model’s optimization effectiveness in reducing passengers’ MS experience. By incorporating the MSDV as a multidimensional evaluation parameter into the reward function, the system can more precisely quantify MS-inducing factors in dynamic ride environments.

(3) Performance Evaluation: The effectiveness of the HSRL is validated through several simulation experiments, where the HSRL method is compared with PID and MPC controllers. The results demonstrate that the HSRL proposed in this study significantly outperforms other controller methods in tracking performance while reducing the MSDV by 15.7%.

The structure of this paper is as follows: Section 2 discusses and summarizes related work. Section 3 defines the problem and introduces our overall framework along with the HSRL method. Section 4 describes the experimental setup and compares the results of different methods. Section 5 concludes our work and presents future directions.

2. Related Works

2.1. Geometric Controllers

Geometric controllers, as classical methods in the field of autonomous vehicle path tracking, rely on vehicle kinematic models to establish analytical geometric constraints. Their core principle revolves around simplifying multi-wheel systems into equivalent single-track models through Ackermann steering configurations, generating steering control inputs via preview mechanisms or error feedback mechanisms [23]. Representative algorithms such as PP and the Stanley method achieve path tracking through arc fitting of preview points or lateral error compensation. Notably, Yunxiao Shan et al. [24] introduced an enhanced PP algorithm, substituting the conventional circular fitting approach with a clothoid curve to achieve greater path-fitting precision. Additionally, a fuzzy logic controller was implemented to dynamically optimize the look-ahead distance—a critical parameter dominantly affecting the PP algorithm’s control performance. Meanwhile, the gain parameters of the Stanley method can be optimized through neural dynamic programming for data-driven refinement [25].

Although geometric controllers are widely adopted in low-speed scenarios due to computational efficiency and parameter interpretability, their kinematic assumptions lead to significant dynamic coupling effects in high-speed conditions. And rigid geometric constraints tend to cause feedback lag and curvature discontinuities. More critically, this control strategy exclusively focuses on precise geometric path tracking without proactive adaptation to human perceptual dynamics, where operations like abrupt acceleration and frequent steering may exacerbate passengers’ MS symptoms [12].

2.2. Model-Free Controllers

Model-free controllers operate on the core principle of generating control inputs through direct mapping between feedback signals and preset control laws. As a classic model-free controller, PID dynamically adjusts steering angle outputs through error proportional, integral, and derivative terms. Their structural simplicity and real-time performance have enabled widespread adoption in autonomous vehicle steering control. For instance, Wael Farag [26] proposed a unique trial-and-error-based technique that employs the Cross-Track Error (CTE) as the sole input to the controller, with the output being the steering command. Park et al. [27] implemented high-precision steering actuator tracking using a PID architecture with dead-zone compensation, while Amer et al. [14] enhanced path tracking robustness through dual-loop PID design. Current research focuses on hybrid paradigms combining data-driven approaches with model-assisted techniques [28], such as introducing fuzzy logic to adaptively adjust PID gains or constructing dynamic optimization strategies for parameters based on RL. These innovations aim to enhance environmental adaptability while preserving the engineering convenience of model-free controllers, demonstrating potential advantages in low-speed scenarios with high uncertainty.

Despite achieving satisfactory performance through parameter fine-tuning, the PID method still faces inherent limitations due to the absence of feedforward compensation mechanisms for vehicle dynamic coupling and time-varying disturbances. Its parameter tuning heavily relies on empirical trial-and-error methods, while fixed gains struggle to adapt to nonlinear characteristics under complex operating conditions, such as sudden road adhesion changes and load transfers. Moreover, model-free controllers fundamentally lack active modeling capabilities for correlating human vestibular perception with vehicle motion dynamics [29]. They cannot suppress occupant sensory conflicts through dynamic adjustments of acceleration profiles or steering smoothness parameters, resulting in elevated MS risks induced by abrupt acceleration/deceleration or high-frequency lateral movements.

2.3. Model-Based Controllers

Model-based control methods construct control laws through the deep coupling of vehicle dynamic characteristics. The core principle is to utilize system state equations and optimal control theory for feedforward–feedback collaborative optimization [30]. Typical paradigms include MPC [31] and LQR. MPC generates control sequences that satisfy multiple constraint conditions by optimizing over a rolling time horizon. Jacob et al. reduced the computation time to 2 ms by using custom C code [32], significantly enhancing engineering feasibility. Merabti et al. [33] deployed Beal’s optimization method to solve the nonlinear problem of a vehicle. LQR, on the other hand, solves for the optimal gain analytically based on a quadratic cost function but requires path pre-sampling to compensate for its lack of foresight. The feedforward-feedback composite architecture proposed by Shladover et al. [34] innovatively combines path curvature feedforward with differential feedback of deviations, while the lateral offset tangent condition introduced by Kapania et al. [35] further enhances tracking robustness in extreme conditions. Current research focuses on simplifying model order reduction, data-driven model calibration, and hybrid control strategy design to balance theoretical rigor with engineering applicability.

Although these methods perform noticeably in terms of accuracy, their performance is constrained by model mismatches and computational complexity. And such methods cannot actively suppress sensory conflicts caused by sudden longitudinal acceleration or low-frequency lateral oscillations.

3. Methodology

3.1. Problem Definition

This study aims to resolve control challenges in autonomous driving by enabling real-time and seamless path tracking within traffic environments. The proposed control framework necessitates simultaneous minimization of trajectory deviations and mitigation of passengers’ MS perception. At each control update interval, the path-tracking system generates optimal control commands based on real-time vehicle motion states, waypoint coordinates, and road curvature parameters, thereby achieving a desired tracking performance that balances kinematic precision with passenger comfort requirements.

3.1.1. Input Representation

A state space representation is proposed for HSRL as $S = \{d x, d y, Δ φ, v, k\}$ , where each state variable is defined as follows:

$d x \in R$ and $d y \in R$ denote the directional deviations $(m)$ between the vehicle’s current position and reference path waypoint, as shown in Figure 1.

$Δ φ \in (- π, π] (r a d)$ quantifies the heading error, defined as the angular difference between the vehicle’s orientation and the target trajectory direction, as shown in Figure 1.

v represents the vehicle’s longitudinal velocity.

k denotes the curvature parameter $(m^{- 1})$ of the predefined reference path. By incorporating this parameter and explicitly quantifying the geometric characteristics of the path, the path-tracking model significantly enhances its adaptability to unstructured road geometries.

3.1.2. Output Representation

The action space design $A = \{a_{a c c}, δ_{s t e e r}\}$ proposed in this study achieves both action space simplification and implicit encoding of physical constraints via piecewise linear mapping of longitudinal acceleration command and direct control of steering wheel angles. Specifically, the longitudinal acceleration command $a_{a c c} \in [- a_{m a x}, a_{m a x}]$ is bounded by a threshold $a_{t h} = - 0.5 \cdot a_{m a x}$ . Through extensive trials and validations, setting the coefficient of $a_{t h}$ to $0.5$ yields the most stable system response; deviating from this value (either smaller or larger) tends to introduce oscillatory behavior or control jitter. This is implemented via mutually exclusive $brake / throttle$ piecewise linear functions, defined as Equation (1), which ensure smooth control transitions by avoiding abrupt switching between control modes. Meanwhile, $δ_{s t e e r} \in [- δ_{m a x}, δ_{m a x}]$ directly governs the steering angle.

(1) $brake = \{\begin{matrix} 2 (a_{m a x} + a_{a c c}) & if a_{a c c} \leq a_{t h} \\ 0 & otherwise \end{matrix}, throttle = \{\begin{matrix} 0 & if a_{a c c} \leq a_{t h} \\ \frac{2}{3} (a_{a c c} + a_{t h}) & otherwise \end{matrix}$

3.2. Method Framework

The HSRL framework is proposed in this section with a two-stage mechanism to balance path tracking accuracy with passenger MS mitigation, comprising (1) an offline supervised learning phase for initial policy and (2) an online RL phase for adaptive optimization. The architectural schematic of the proposed framework is illustrated in Figure 2.

During the offline supervised learning phase, an expert agent is employed to collect a trajectory dataset within the target environment, capturing expert demonstrations for policy initialization. Subsequently, the initial policy network is trained via supervised learning using BC, enabling the acquisition of static mapping relationships inherent in the expert strategies [36].

Following offline supervised learning, an Actor–Critic architecture is further refined using online interaction data from RL. By incorporating exploration noise and temporal difference learning to optimize long-term returns while preserving expert prior knowledge, this approach achieves a seamless transition from supervised learning to autonomous path tracking. The hybrid paradigm significantly enhances training efficiency and policy stability while substantially reducing temporal training overhead.

3.3. Offline Supervised Learning

During the supervised learning phase, the expert agent collects a dataset containing state–action pairs, where the state space $S$ comprises multi-dimensional continuous features, and the action space $A$ represents continuous control signals. The supervised learning process is illustrated in Figure 3.

3.3.1. Expert Demonstration

The expert agent performs tasks within the widely adopted Longest6 benchmark [37] in autonomous driving to collect datasets. In this study, Town04 was selected for agent data collection due to its road environment complexity. The expert agent is composed of an A* planner and two PID controllers (lateral and longitudinal) to generate high-quality datasets.

The A* planner [37] receives a discrete coordinate sequence from the Longest6 benchmark as input and generates a high-density trajectory sequence as output. By leveraging the topological structure of the CARLA map, it creates continuous road-compliant paths between sparse waypoints through a systematic search process. After obtaining the high-density trajectory sequence generated by the A* planner, the lateral PID controller calculates the lateral control error $e_{t}^{lat}$ based on the extracted position coordinates from the trajectory. The longitudinal PID controller computes the longitudinal control error $e_{t}^{lon}$ using the target speed (configured via speed settings) and the current vehicle speed. In this study, the speed configuration is adjustable within a range of 10–130 km/h. Specifically, the $e_{t}^{lon}$ and $e_{t}^{lat}$ are calculated as follows:

(2) $\begin{matrix} e_{t}^{lat} & = arccos (\frac{{\vec{ω}}_{t} \cdot {\vec{v}}_{t}}{∥ {\vec{ω}}_{t} ∥ ∥ {\vec{v}}_{t} ∥}) \end{matrix}$

(3) $\begin{matrix} e_{t}^{lon} & = v_{t}^{target} - v_{t} \end{matrix}$

Here, ${\vec{ω}}_{t}$ represents the vector from the current position of vehicle to its next closest waypoint, which is derived from the high-density trajectory sequence generated by the A* planner. ${\vec{v}}_{t}$ , $v_{t}^{target}$ , and $v_{t}$ represent the velocity vector, the desired velocity, and current velocity of vehicle, respectively. Then, the lateral and longitudinal control output $δ_{s t e e r}$ and $a_{a c c}$ are computed separately using PID controllers. Then, the control outputs were saved as our action space $A = \{a_{a c c}, δ_{s t e e r}\}$ .

(4) $\begin{matrix} δ_{s t e e r} & = K_{P}^{lat} \cdot e_{t}^{lat} + K_{I}^{lat} \cdot \int_{0}^{t} e_{x}^{lat} d x + K_{D}^{lat} \cdot \frac{d}{d t} e_{t}^{lat} \end{matrix}$

(5) $\begin{matrix} a_{a c c} & = K_{P}^{lon} \cdot e_{t}^{lon} + K_{I}^{lon} \cdot \int_{0}^{t} e_{x}^{lon} d x + K_{D}^{lon} \cdot \frac{d}{d t} e_{t}^{lon} \end{matrix}$

where

K_{P}^{lon}

K_{I}^{lon}

K_{D}^{lon}

and

K_{P}^{lat}

K_{I}^{lat}

K_{D}^{lat}

represent the coefficients of the proportional, integral, and derivative terms in the lateral and longitudinal control, respectively.

To generate training data, the expert agent performs autonomous driving tasks in sequence along four routes provided by the Longest6 benchmark in Town04. During these tasks, required state space data $S = \{d x, d y, Δ φ, v, k\}$ is collected along the routes and combined with the action space $A = \{a_{a c c}, δ_{s t e e r}\}$ to form a state–action pair dataset.

The dataset containing state–action pairs is then randomly split into training and validation sets with a 4:1 ratio. The training set is used for policy network optimization via gradient descent, while the validation set monitors model generalization performance to prevent overfitting. To ensure consistent evaluation, the dataset partitioning employs stratified random sampling, which preserves the distribution characteristics of both states and actions.

3.3.2. Loss Function

The optimization objective of BC is to minimize the discrepancy between policy network outputs and expert actions. This study employs Mean Squared Error (MSE) as the loss function, defined as

(6) $L (ϕ) = \frac{1}{N} \sum_{i = 1}^{N} {∥ π_{ϕ} (s_{i}) - a_{i} ∥}_{2}^{2}$

where N denotes the batch size,

π_{ϕ} (s_{i})

represents the action predicted by the initial policy network, and

a_{i}

corresponds to the expert action. The policy parameters are iteratively updated via gradient descent as follows:

(7) $ϕ = ϕ - α \nabla_{ϕ} L (ϕ)$

where

α

denotes learning rate, and

\nabla_{ϕ}

calculates the gradient with respect to policy parameters

ϕ

at the iteration t. The training process utilizes mini-batch gradient descent with a batch size of 64 and the Adam optimizer, initialized with a learning rate of

10^{- 3}

. After each training epoch, the validation loss is computed by averaging the MSE over all validation samples, providing a robust indicator of model convergence. This optimization process is achieved through backpropagation, ensuring the progressive convergence of the initial policy’s action predictions toward the expert demonstration distribution. In continuous action spaces, the MSE directly quantifies the Euclidean distance between predicted actions and expert actions, with its convexity and smoothness facilitating gradient-based optimization.

3.4. Online RL Framework

After completing offline supervised learning, the pre-trained policy is loaded, with two value networks initialized. Subsequently, the efficient TD3 algorithm [38] is adopted for policy updates. Specifically, it employs two independent critic networks and delayed updates for the actor network to mitigate issues such as Q-value overestimation and inherent, unstable policy updates. The architecture of the TD3-based algorithm in the proposed RL phase is illustrated in Figure 4.

3.4.1. Markov Decision Process

In this phase, a sequential modeling approach is adopted, where the agent continuously interacts with the environment to acquire historical state–action information and perform actions at the current time step. This process is formalized as a standard MDP [39], defined as a tuple $M = (S, A, P, r, γ)$ , where $S$ is the state space, $S \subseteq R^{n}$ , containing instantaneous information about the interaction between the agent and the environment. $A$ is the action space, $A \subseteq R^{m}$ , representing the set of actions the agent can perform. Each action a $\in A$ corresponds to a control command. $P$ is the state transition probability function, $P : S \times A \times S \to [0, 1]$ . r is the reward function, and $r : S \times A \to R$ . $γ \in [0, 1]$ is the discount factor.

In the MDP framework, the goal of RL is to find an optimal policy $π_{ϕ}$ that maximizes the cumulative reward. Specifically, the objective is to maximize the expected cumulative reward $J_{R} (π_{ϕ})$ , which is the total expected reward over time. The state space and action space definitions in this section follow the definitions established in the problem statement and are consistent with those used in the offline supervised learning phase.

3.4.2. Reward Function

To strike a balance between tracking performance and comfort, this section constructs a reward function architecture, encompassing components such as velocity, trajectory, heading, control, and MSDV rewards to support the optimization of the policy and accelerate convergence. The total reward is formulated as follows:

(8) $R_{t o t a l} = R_{v e l o c i t y} + R_{t r a j e c t o r y} + R_{h e a d i n g} + R_{c o n t r o l} + R_{M S D V}$

The velocity component $R_{v e l o c i t y}$ evaluates speed management by rewarding proximity to the target velocity:

(9) $R_{v e l o c i t y} = α_{1} \cdot (1 - \frac{v_{e g o} - v_{d e s}}{v_{m a x}})$

where

v_{e g o}, v_{d e s}

, and

v_{m a x}

represent the current velocity, desired velocity, and maximum permissible speed, respectively. This encourages appropriate speed adaptation throughout the driving task. The trajectory component

R_{t r a j e c t o r y}

evaluates the arrival of path points by measuring the displacement deviation in the x and y directions:

(10) $R_{t r a j e c t o r y} = - α_{2} \cdot (d x^{2} + d y^{2})$

The heading component $R_{h e a d i n g}$ evaluates directional alignment with the intended route:

(11) $R_{h e a d i n g} = - α_{3} \cdot δ_{a n g u l a r}$

where

δ_{a n g u l a r}

measures the absolute angular deviation between vehicle orientation and route direction. This ensures proper vehicle alignment along the planned path. The control component

R_{c o n t r o l}

incentivizes smooth steering inputs:

(12) $R_{c o n t r o l} = \{\begin{matrix} - β_{1} & if |δ_{s t e e r}^{t} - δ_{s t e e r}^{t - 1}| > δ_{s t e e r}^{t h r e s h o l d} \\ 0 & otherwise \end{matrix}$

where

δ_{s t e e r}^{t}

and

δ_{s t e e r}^{t - 1}

represent current and previous steering angles, with

δ_{s t e e r}^{t h r e s h o l d}

set at 0.01. This discourages abrupt steering adjustments that could compromise ride comfort.

The MSDV is a metric that measures the accumulation of MS over time, as outlined in the ISO 2631 standard [10]. This metric accounts for sickening stimuli by applying frequency-dependent weightings to acceleration across different frequency ranges; this is because MS is influenced by the frequency of motion to which individuals are exposed [40]. The MSDV is defined as Equation (14):

(13) $R_{M S D V} = - β_{2} \cdot M S D V$

(14) $M S D V = \sqrt{\int_{0}^{T} {[a_{x, w_{1}} (t)]}^{2} d t} + \sqrt{\int_{0}^{T} {[a_{y, w_{2}} (t)]}^{2} d t}$

where

a_{x, w_{1}} (t)

and

a_{y, w_{2}} (t)

are weighted accelerations in longitudinal and lateral directions in time domain;

d t

is the time increment, and T is the exposure time. The MS frequency weighting curve proposed in [41] was adopted as the baseline. Through multiple iterations of revision and experimental validation, the optimal average weights for

a_{x, w_{1}} (t)

and

a_{y, w_{2}} (t)

were ultimately determined as

0.6

and

0.4

, respectively. This parameter selection achieved dual objectives of minimizing weighting errors while maintaining computational simplicity in reward estimation.

3.4.3. RL Optimization Process

In the online RL framework, the HSRL achieves policy optimization through continuous interaction with the dynamic environment. Its core objective is to maximize long-term cumulative returns. A dual critic network (Double Q-Learning) is introduced, which updates its parameters using temporal difference (TD) method, along with the pre-trained policy network $π_{ϕ}$ . The two target critic networks calculate the value of the next state:

(15) $\{\begin{matrix} y_{1} = r + γ Q_{θ_{1}^{'}} (s_{t + 1}, {\tilde{a}}_{t + 1}) \\ y_{2} = r + γ Q_{θ_{2}^{'}} (s_{t + 1}, {\tilde{a}}_{t + 1}) \end{matrix}$

where the

{\tilde{a}}_{t + 1}

represents the action generated by the target policy network and perturbed with clipped noise. In addition, to address the trade-off between bias and variance, the calculation of the q-value should be smoothed to avoid overfitting. Therefore, truncated normal distribution noise is added to each action as regularization, making the modified target update as follows:

(16) ${\tilde{a}}_{t + 1} \leftarrow π_{ϕ^{'}} (s_{t + 1}) + ϵ, ϵ \sim c l i p (N (0, \tilde{σ}), - c, c)$

The minimum output value of the target network is selected as the target q-value to offset the overestimation problem of q-values. This is substituted into the Bellman Equation (19) to compute the TD-error and the loss function (18):

(17) $y = r_{t} + γ min_{i = 1, 2} Q_{θ_{i}^{'}} (s_{t + 1}, {\tilde{a}}_{t + 1})$

(18) $L = \frac{1}{N} \sum {(y - Q_{θ_{i}} (s_{t}, a_{t}))}^{2}$

(19) $Q_{θ_{i}^{'}} (s_{t}, a_{t}) = E [r (s_{t}, a_{t}) + γ Q_{θ_{i}^{'}} (s_{t}, {\tilde{a}}_{t})]$

However, observations with errors are prone to causing divergence. Therefore, to minimize error propagation, the policy network is designed to update at a lower frequency than the critic network. That is, after multiple updates of the critic network, the policy network adjusts its parameters through gradient backpropagation. The lower the update frequency of the policy network, the smaller the variance in the q-value function updates, leading to a higher-quality policy. The parameter updates of the policy network are achieved through deterministic policy gradients. The specific loss gradient formula is as follows:

(20) $\nabla_{ϕ} J \leftarrow \frac{1}{N} \sum \nabla_{a_{t}} Q_{θ_{1}} (s_{t}, a_{t}) |_{a_{t} \sim π_{ϕ} (s_{t})} \nabla_{ϕ} π_{ϕ} (s_{t})$

3.5. Path-Tracking Algorithm Based on HSRL

In this section, the path-tracking algorithm based on HSRL is described in detail, as shown in Algorithm 1. A two-stage framework is implemented to achieve stable and efficient tracking control. The algorithm first initializes the policy network $π_{ϕ}$ and collects dataset $D$ using the expert agent (lines 1–6). During the supervised learning phase, state–action pairs are sampled from expert dataset $D$ to minimize the MSE between the policy network’s outputs and the expert’s actions through BC, completing the initial imitation learning of the policy (lines 7–10).

Algorithm 1 Path tracking based on HSRL.

Input: trajectory length L, pretraining iterations I, episode timesteps T, learning rate $α$

Initialize env, expert agent, offline dataset $D$ , initialized policy $π_{ϕ}$ with random weights

Phase 1: Dataset acquisition and offline pretraining

1:. Sample the initial state $s_{1}$ from the env

2:. for each timestep l in L do

3:. $a_{l} \leftarrow expert agent.get_action(s_{l})$ ;

4:. Execute $a_{l}$ , observe reward $r_{l}$ and next state $s_{l + 1}$ ;

5:. Store transition tuple $(s_{l}, a_{l}, r_{l}, s_{l + 1})$ in offline dataset $D$ ;

6:. end for

7:. for each iteration i in I do

8:. Sample batch ${(s_{i}, a_{i})}$ from $D$

9:. Update $ϕ$ via supervised learning: $min \sum ∥ π_{ϕ} (s_{i}) - a_{i} ∥^{2}$

10:. end for

Phase 2: Online Reinforcement Learning

Initialize critic networks $Q_{θ_{1}}, Q_{θ_{2}}$ and target networks $θ_{1}^{'} \leftarrow θ_{1}$ , $θ_{2}^{'} \leftarrow θ_{2}$ , $ϕ^{'} \leftarrow ϕ$ with random weights, replay buffer $B$

11:. for each timestep t in T do

12:. Select action: $a_{t} \leftarrow π_{ϕ} (s_{t}) + ϵ$ , $ϵ \sim N (0, σ)$

13:. Execute $a_{t}$ , observe $r_{t}, s_{t + 1}$ , store $(s_{t}, a_{t}, r_{t}, s_{t + 1})$ in $B$

14:. Sample mini-batch of N transitions $(s_{t}, a_{t}, r_{t}, s_{t + 1})$ from $B$

15:. Generate target action: ${\tilde{a}}_{t + 1} \leftarrow π_{ϕ^{'}} (s_{t + 1}) + ϵ$ , $ϵ \sim c l i p (N (0, \tilde{σ}), - c, c)$

16:. Compute target value: $y \leftarrow r_{t} + γ {min}_{i = 1, 2} Q_{θ_{i}^{'}} (s_{t + 1}, {\tilde{a}}_{t + 1})$

17:. Update critic: $θ_{i} \leftarrow arg {min}_{θ_{i}} \frac{1}{N} \sum {(y - Q_{θ_{i}} (s_{t}, a_{t}))}^{2}$

18:. if $t mod d = 0$ then

19:. Compute policy gradient:

20:. $\nabla_{ϕ} J \leftarrow \frac{1}{N} \sum \nabla_{a_{t}} Q_{θ_{1}} (s_{t}, a_{t}) |_{a_{t} \sim π_{ϕ} (s_{t})} \nabla_{ϕ} π_{ϕ} (s_{t})$

21:. Update actor: $ϕ \leftarrow ϕ + α \nabla_{ϕ} J$

22:. Soft update target networks:

23:. $θ_{i}^{'} \leftarrow τ θ_{i} + (1 - τ) θ_{i}^{'}$

24:. $ϕ^{'} \leftarrow τ ϕ + (1 - τ) ϕ^{'}$

25:. end if

26:. end for

In the online RL phase, the policy network generates action $a_{t}$ based on the current state, and after execution, collects the transition tuple $(s_{t}, a_{t}, r_{t}, s_{t + 1})$ in the experience replay buffer $B$ (lines 11–13). In the update phase, after sampling the mini-batch of data, the TD target value is calculated, where ${\tilde{a}}_{t + 1}$ is generated by the target policy network $π_{ϕ^{'}}$ , and clipped noise is added. The parameters of the dual critic networks are updated by minimizing the MSE between the predicted Q-value and the target value (lines 14–17). To suppress the overestimation of Q-values, every d steps, the policy network is updated through gradient backpropagation based on the critic network, and soft updates are performed on the target network (lines 18–25). This design combines the stability of supervised learning with the adaptability of online RL, effectively enhancing the robustness and convergence efficiency of the policy through the dual critic mechanism and noise smoothing.

4. Experiment

This section comprehensively outlines the experimental setup, selected baselines, assessment protocols, ablation study, hyperparameter sensitivity analysis of the reward function, and empirical results analysis, with the aim of validating the proposed method’s effectiveness and quantifying performance improvements across critical metrics.

4.1. Experimental Setup

4.1.1. Implementation Details

All experiments utilized a NVIDIA GeForce RTX 3090 GPU, and the HSRL model was implemented using the PyTorch 2.7.1 framework. All evaluations in this experiment were conducted within the CARLA 0.9.14 environment.

4.1.2. Training Dynamics Analysis

The training process consists of two sequential phases: supervised training via BC and RL. During the supervised training phase, expert demonstrations are collected using an expert agent to construct a dataset from which state–action pairs are extracted for initial policy cloning. The loss reduction curve of the BC procedure is illustrated in Figure 5b. Subsequently, in the RL phase, the agent interacts with the environment within training scenario represented in Figure 5a at speeds ranging from 10 to 120 km/h to acquire rewards for policy updates. The learning curve of this training process is shown in Figure 5c, where the cumulative reward converges after 135 episodes.

4.1.3. Baselines and Evaluation Metrics

Tests were conducted on publicly available predefined paths, where selected comparison methods were maintained under optimal operational conditions throughout testing. Methods include PID based on Wael Farag Tuning Method (WAF-Tune) [26] and MPC-based path tracking [31].

WAF-Tune-based PID represents a novel tuning scheme in the PID framework. It uses an ad hoc trial-and-error-based technique and uses only the CTE as an input to the controller, whereas the output is the steering command. The steering command is produced after applying proportional, integral, and differential control to it in terms of $K p$ , $K i$ , and $K d$ coefficients, respectively. Its main design effort is to carefully tune these three coefficients to obtain the best possible performance. The performance can be simply defined as letting the vehicle follow the predefined path as closely as possible with the lowest aggregated CTE throughout the entire trip. The main goal of the controller is to minimize the aggregated CTE (the objective function), as given by Equation (21). The PID controller was configured with a throttle input of 0.3 and the coefficients ( $K p, K i$ , and $K d$ ) set to (0.35, 0.0005, and 6.5), which were finalized using WAF-Tune method.

(21) $Objective Function = {min}_{i = 0}^{N} \frac{1}{N} \sum_{i = 0}^{N} C T E_{i}^{2}$

MPC-based path tracking [31] is a classical model-based controller. It employs a well-known bicycle model with four states $[x, y, θ, v]$ , as follows:

(22) $\{\begin{matrix} \dot{x} & = v cos (θ + β) \\ \dot{y} & = v sin (θ + β) \\ \dot{θ} & = \frac{v}{l_{r} + l_{f}} sin (β) \\ \dot{v} & = a \end{matrix}$

where

β = arctan (\frac{l_{r}}{l_{r} + l_{f}} tan δ)

denotes the slip angle, and

l_{r}

and

l_{f}

are the distances from rear and front axles to the center of vehicles, respectively.

θ

and a are the steering angle of front wheels and acceleration, respectively.

For a specified reference trajectory within the prediction horizon, the objective of the model predictive control system is to minimize the discrepancy between the predicted output and the desired path. This is achieved through an optimized cost function that enables the autonomous vehicle to track the target trajectory with both rapid response and smooth maneuvering. To ensure effective path tracking performance, the controller must simultaneously address system state deviations and optimize control outputs. The resulting objective function for the path-tracking controller is formulated as a quadratic cost function that incorporates both the system states and control inputs:

(23) $\begin{matrix} J (k) = \sum_{j = 1}^{N_{p}} & [{\tilde{χ}}^{T} (k + j) Q \tilde{χ} (k + j) + {\tilde{u}}^{T} (k + j - 1) \tilde{R} \tilde{u} (k + j - 1)] \\ s . t . & u_{min} (t + k) \leq u (t + k) \leq u_{max} (t + k), \\ Δ u_{min} (t + k) \leq Δ u (t + k) \leq Δ u_{max} (t + k), \\ y_{min} (t + k) \leq y (t + k) \leq y_{max} (t + k) . \end{matrix}$

where Q and R denote the weighting matrices for control actions and control increments, respectively; the index k adopts different ranges depending on the constraint type:

k = 0, 1, \dots, N_{c} - 1

for control and control increment constraints, whereas

k = 1, \dots, N_{p}

corresponds to output constraints. The first component in Equation (23) quantifies the system’s tracking performance capability, while the second term enforces regulation on the control sequence. A notable advantage of this objective function formulation is its inherent structure that facilitates direct transformation into a standard quadratic programming (QP) problem.

Moreover, the parameter settings for PID and MPC are consistent with those in the original study. The main parameters of the experiments are referenced in Table 1.

To validate the effectiveness of the model, the experiments evaluated the performance of HSRL in terms of tracking performance and reduction in MS, comparing it with PID and MPC. For path tracking, lateral deviation was selected as the primary evaluation metric, with the reason being that the lateral error directly determines whether the vehicle deviates from the lane or trajectory and serves as the primary factor affecting collision risks.

Jerking and the MSDV are key contributors to MS, where jerking dominates transient discomfort (e.g., sudden braking causing a “lurching sensation”), while the MSDV reflects the cumulative fatigue-inducing stimulation of the vestibular system due to prolonged vibrations (e.g., low-frequency body sway during continuous winding mountain roads).

4.2. Tracking Performance

The tracking and MS tests were conducted sequentially in S-shape, U-shape, and O-shape routes in Figure 6 at a speed of 35 km/h. The high-speed application is demonstrated later. The S-shape route features single-lane roads and T-shaped intersections, making it suitable for evaluating basic urban driving. In contrast, the O-shape route includes a sharp turn, which tests the vehicle’s steering performance.

Table 2 shows the statistical results of lateral deviation and jerking. It is observed that HSRL achieves consistently high average performance in both training and testing scenarios.

The path tracking comparison results are shown in Figure 7. In the simpler test scenario, the S-shape route Figure 6a, the error control performance of all three controllers is better than rest of test scenarios. Additionally, only the HSRL method was able to control the error within a relatively narrow range. With respect to low-error-range control, the PID and HSRL controllers exhibit similar effectiveness in precision tracking. In simple terms, the HSRL model outperformed the PID and MPC controllers in terms of tracking. The excellent tracking performance of the HSRL is due to the data-driven mechanism, which can directly learn the nonlinear dynamic characteristics of complex systems without relying on precise mathematical models, making it highly adaptable.

As presented in Table 3, the maximum lateral deviation error (ME), its occurrence timestep (t), and the corresponding road curvature (k) are detailed. For the PID controller, its fixed gain parameters struggle to adapt to rapidly changing tracking demands in highly dynamic scenarios. The integral term may accumulate significant errors, while the derivative term’s sensitivity to noise can cause oscillatory control outputs, ultimately leading to substantial error values. Although the MPC controller excels in prediction and optimization, its computational delay may become pronounced under complex scenarios. The HSRL controller’s training data may not have fully covered extreme or rare operating conditions, resulting in the maximum lateral deviation error. Nevertheless, it outperformed both PID and MPC.

4.3. Reduction in MS Performance

To measure the performance of three controllers in reducing passengers’ MS, jerking and the MSDV were considered simultaneously. The experimental results of jerking are shown in Figure 8, and the MSDV comparison results are shown in Figure 9.

With respect to jerks, MPC cannot achieve satisfactory performance in either maximum jerk control or low-jerk-range scenarios. PID and HSRL methods have similar performance, while the HSRL performs better in low-jerk-range scenarios.

In MSDV reduction, the PID and MPC methods have similar performance. Additionally, the HSRL method demonstrates a significant reduction in MSDV values compared to PID and MPC approaches. Based on data from three test scenarios, the HSRL method reduces the MSDV by approximately 15.7%, which is projected to effectively lower the incidence of passenger MS.

Furthermore, it is also observed that as the scenario becomes more complex from the S-shape route to O-shape, the final accumulated MSDV shows a progressive rise. Thanks to the HSRL supporting multi-objective joint optimization, which integrates the standardized MSDV into the reward mechanism, the proposed method performs excellently in reducing passengers’ MSDV.

The synergistic effects of jerking and the MSDV collectively determine passengers’ MS perception. While jerking quantifies transient discomfort caused by abrupt acceleration changes, the MSDV reflects the cumulative vibrational energy exposure over time—a critical factor for prolonged sickness onset. The experimental results demonstrate that HSRL’s ability to simultaneously suppress high-frequency jerks (reducing instantaneous discomfort) and minimize MSDV accumulation (mitigating long-term sickness risk) creates a complementary advantage. Specifically, smoother jerk transitions alleviate acute motion disturbances, whereas the 15.7% MSDV reduction indicates substantially lower cumulative motion energy transmission to passengers. Such comprehensive performance enhancement becomes particularly vital in complex driving scenarios where sharp, intermittent jerks and prolonged irregular vibrations synergistically exacerbate MS severity, highlighting the necessity of multi-objective control frameworks for holistic ride comfort improvement.

4.4. High-Speed Performance

The high-speed performance evaluation was conducted under a unified reference speed profile, as shown in Figure 10b, which required rapid acceleration to peak speeds followed by deceleration before sharp turns. The test scenario (Figure 10a) included multiple curve points to test path tracking robustness. The PID, MPC, and the proposed HSRL were compared based on lateral deviation (Figure 10c) and the MSDV over time (Figure 10d).

Regarding path tracking, PID exhibited the largest lateral deviations with pronounced fluctuations, while MPC showed moderate improvements but still struggled with sharp curves. In contrast, HSRL demonstrated the smallest overall deviations, with a tighter distribution and fewer extreme values. For the MSDV, PID and MPC showed gradual increases, but HSRL maintained the lowest cumulative values throughout the experiment. Notably, HSRL achieved superior path tracking stability while minimizing passenger discomfort, outperforming both PID and MPC in balancing high-speed dynamics and ride smoothness. The superior performance of HSRL in high-speed scenarios is attributed to its multi-objective optimization framework and inherent robustness, which enable precise path tracking while minimizing passenger discomfort.

4.5. Reward Hyperparameter Sensitivity Analysis

To investigate the impact of the reward weight hyperparameters on algorithm performance, a sensitivity analysis was conducted, as shown in Figure 11. The x-axis represents different parameter sets, with specific hyperparameter values detailed in Table 4. The y-axis indicates the corresponding reward scores for these parameter sets, representing the overall performance of the method. All subsequent experiments in this paper were implemented using the highest-scoring configuration, D.

Compared to D, configuration A increases the speed reward coefficient, forcing the vehicle to maintain high speed in curves and seriously compromising path tracking accuracy, ultimately leading to a decrease in total reward. Configuration B enhances the weight of control components, resulting in a decline in the reward. This is because overemphasizing steering smoothness causes delayed steering responses during sharp turns, increasing tracking errors. Configurations C and E, respectively, amplify the weights of tracking error and the MSDV, both resulting in reduced rewards. This occurs because overemphasizing either tracking error or the MSDV negatively impacts the other metric.

4.6. Ablation Study

To comprehensively evaluate HSRL’s performance across different components, this study designed a series of ablation experiments. In these experiments, specific components of HSRL were removed sequentially, and the resulting models were evaluated in three test scenarios shown in Figure 6.

To evaluate the overall performance of the resulting models, the test data from three scenarios were aggregated by calculating the mean, standard deviation, maximum value, and Total Motion Sickness Dose Value (TMSDV). The mean quantifies the central tendency of the system’s behavior, the standard deviation reflects the consistency across scenarios, the maximum value highlights potential performance limits, and the TMSDV serves as a quantitative metric for assessing the impact of vehicle dynamic control on passenger comfort. These aggregated results are summarized in Table 5.

In the experimental data, it can be observed that when the supervised learning objective is removed, the model HSRL w/o $L^{S L}$ exhibits significant performance degradation across all metrics compared to the full HSRL model. Specifically, the following were found:

Lateral deviation increases from 0.0925 m (mean) to 0.1468 m, with a 41.3% increase in standard deviation.

Jerk rises from 0.0202 m/ $s^{3}$ (mean) to 0.0315 m/ $s^{3}$ , accompanied by a 20.8% increase in standard deviation.

TMSDV increases from 475.7 to 681.3, with a 43.2% increase.

In contrast, the HSRL w/o $L^{R L}$ shows moderate degradation (e.g., TMSDV increases by 20.4%), indicating that RL contributes to adaptability but is less essential than supervised learning for core accuracy and comfort. This further confirms that supervised learning component is critical for foundational performance, while RL provides complementary adaptability.

5. Conclusions

This paper proposes an innovative HSRL framework, which employs a two-stage optimization mechanism. In the first stage, an initial policy is trained via BC to rapidly acquire fundamental knowledge. In the second stage, the model is refined through RL, enabling autonomous exploration to achieve superior performance. For RL training, HSRL leverages the TD3 algorithm, ensuring both operational efficiency and robust safety guarantees for the AD system. Validated through experiments across multiple test scenarios, the HSRL method demonstrates outstanding path-tracking performance and generalization capability while reducing the MSDV by 15.7% during AD operations.

Looking ahead to future research directions, the proposed HSRL framework still holds significant development potential. In MS dynamics, while this study quantifies MS via acceleration metrics, future investigations will integrate vehicle-specific dynamics, including suspension characteristics and body resonance frequencies, to establish a multi-physics discomfort model. Additionally, field tests under heterogeneous road conditions and sensor noise profiles are essential. A phased implementation framework will be designed to address latency tolerance and edge computing constraints in physical vehicle platforms.

Author Contributions

Author Contributions: Conceptualization, Y.L.; methodology, Y.C.; software, Z.C.; validation, Y.T.; formal analysis, Y.L.; investigation, Y.F.; resources, F.G.; writing—original draft preparation, Y.L.; writing—review and editing, R.Z.; visualization, R.Z. All authors have read and agreed to the published version of the manuscript.

Institutional Review Board Statement

Not applicable for studies not involving humans or animals.

Informed Consent Statement

This study does not involve humans.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figures and Tables

Figure 1 Definition of heading error $(Δ φ),$ directional deviations $(d x, d y)$ , and wheelbase $(L_{w})$ .

Figure 2 The overall framework of the proposed two-stage training. It combines offline supervised learning and RL to enhance training stability.

Figure 3 Supervised learning for initial policy.

Figure 4 Architecture of TD3. The Q* represents the minimum output between the two target critic networks.

Figure 5 Training setup of the policy in the HSRL process. (a) W-shape route used for RL training. (b) Supervised learning curve. (c) RL curve.

Figure 6 Three test scenarios for tracking and MS performance validation at a speed of 35 km/h and blue dots mark the vehicle’s start and end locations. (a) S-shape. (b) U-shape. (c) O-shape.

Figure 7 The path tracking comparison results of three scenarios.

Figure 8 The jerking comparison results of three test scenarios.

Figure 9 MSDV comparison results in three different routes. (a), (b) and (c) show the cumulative values of MSDV over time for the vehicle in the S-shape, U-shape, and O-shape scenarios, respectively.

Figure 10 Experimental setup and results for high-speed tests. (a) Random shape. (b) Speed profile. (c) Lateral deviation results. (d) MSDV results.

Figure 11 Reward hyperparameter sensitivity analysis.

Table 1

Main parameters used in the experiments.

Parameters	Value
Supervised Learning
Learning rate	1 × $10^{- 3}$
Policy noise	0.1
Batch size	64
Optimizer	Adam
Reinforcement Learning
Optimizer	Adam
Policy frequency	2
Policy noise	0.2
Policy learning rate	3 × $10^{- 4}$
Discount factor $γ$	0.99
tau	0.005
Reward function weights $α_{1}, α_{2}, α_{3}, β_{1}, β_{2}$	1.5, 1.0, 2.5, 0.1, 1.6
PID based on WAF-Tune
35 km/h (Constant speed)–Lateral (Kp, Ki, Kd)	(0.35, 0.0005, 6.5)
High speed–Lateral (Kp, Ki, Kd)	(0.2, 0.0000, 7.0)
High speed–Longitudinal (Kp, Ki, Kd)	(0.3, 0.0500, 0.5)
MPC
Sample time (s)	0.05
Prediction horizon	20
Control horizon	5
Vehicle mass (kg)	1720
Front suspension stiffness (N/m)	35,000
Rear suspension stiffness (N/m)	30,000

Table 2

Lateral deviation and jerking on different routes.

Scenario	Lateral Deviation (Mean, Std, Max)	Jerk (Mean, Std, Max)
W-shape (training)	(0.0549, 0.0815, 0.4443)	(0.0194, 0.0551, 0.5834)
S-shape (test)	(0.0469, 0.0809, 0.7908)	(0.0189, 0.0571, 0.6745)
U-shape (test)	(0.1219, 0.1587, 0.8659)	(0.0210, 0.0559, 0.6009)
O-shape (test)	(0.1159, 0.1354, 0.7523)	(0.0204, 0.0602, 0.6714)

Table 3

The point of maximum error.

Method	S- Shape (ME, t, k)	U- Shape (ME, t, k)	O- Shape (ME, t, k)
PID	(1.47, 2743, 0.018)	(1.07, 2973, 0.014)	(1.23, 429, 0.027)
MPC	(0.98, 3208, 0.014)	(0.97, 2970, 0.013)	(0.84, 438, 0.025)
HSRL	(0.79, 3215, 0.015)	(0.87, 1988, 0.017)	(0.75, 429, 0.026)

Table 4

Hyperparameter combination.

Hyperparameter Combination	Value
A	$α_{1} = 2.0, α_{2} = 1.0, α_{3} = 2.5, β_{1} = 0.1, β_{2} = 1.6$
B	$α_{1} = 1.5, α_{2} = 1.0, α_{3} = 2.5, β_{1} = 0.2, β_{2} = 1.6$
C	$α_{1} = 1.5, α_{2} = 1.5, α_{3} = 3.0, β_{1} = 0.1, β_{2} = 1.6$
D	$α_{1} = 1.5, α_{2} = 1.0, α_{3} = 2.5, β_{1} = 0.1, β_{2} = 1.6$
E	$α_{1} = 1.5, α_{2} = 1.0, α_{3} = 2.5, β_{1} = 0.1, β_{2} = 2.0$

Table 5

Ablation study results on the objective terms.

Method	Lateral Deviation (Mean, Std, Max)	Jerk (Mean, Std, Max)	TMSDV
HSRL	(0.0925, 0.1185, 0.8659)	(0.0202, 0.0578, 0.6745)	475.7
HSRL w/o $L^{S L}$	(0.1468, 0.1676, 0.9817)	(0.0315, 0.0687, 0.9719)	681.3
HSRL w/o $L^{R L}$	(0.1381, 0.1538, 0.8732)	(0.0293, 0.0613, 0.89324)	572.8

References

1. Parekh, D.; Poddar, N.; Rajpurkar, A.; Chahal, M.; Kumar, N.; Joshi, G.P.; Cho, W. A review on autonomous vehicles: Progress, methods and challenges. Electronics; 2022; 11, 2162. [DOI: https://dx.doi.org/10.3390/electronics11142162]

2. Pettersson, I.; Karlsson, I.M. Setting the stage for autonomous cars: A pilot study of future autonomous driving experiences. IET Intell. Transp. Syst.; 2015; 9, pp. 694-701. [DOI: https://dx.doi.org/10.1049/iet-its.2014.0168]

3. Li, Q.; Chen, L.; Li, M.; Shaw, S.L.; Nüchter, A. A sensor-fusion drivable-region and lane-detection system for autonomous vehicle navigation in challenging road scenarios. IEEE Trans. Veh. Technol.; 2013; 63, pp. 540-555. [DOI: https://dx.doi.org/10.1109/TVT.2013.2281199]

4. Chen, L.; Fan, L.; Xie, G.; Huang, K.; Nüchter, A. Moving-object detection from consecutive stereo pairs using slanted plane smoothing. IEEE Trans. Intell. Transp. Syst.; 2017; 18, pp. 3093-3102. [DOI: https://dx.doi.org/10.1109/TITS.2017.2680538]

5. Fu, Y.; Li, C.; Yu, F.R.; Luan, T.H.; Zhang, Y. A decision-making strategy for vehicle autonomous braking in emergency via deep reinforcement learning. IEEE Trans. Veh. Technol.; 2020; 69, pp. 5876-5888. [DOI: https://dx.doi.org/10.1109/TVT.2020.2986005]

6. Chen, L.; Shan, Y.; Tian, W.; Li, B.; Cao, D. A fast and efficient double-tree RRT*-like sampling-based planner applying on mobile robotic systems. IEEE/ASME Trans. Mechatron.; 2018; 23, pp. 2568-2578. [DOI: https://dx.doi.org/10.1109/TMECH.2018.2821767]

7. Yao, Q.; Tian, Y.; Wang, Q.; Wang, S. Control strategies on path tracking for autonomous vehicle: State of the art and future challenges. IEEE Access; 2020; 8, pp. 161211-161222. [DOI: https://dx.doi.org/10.1109/ACCESS.2020.3020075]

8. Bertolini, G.; Straumann, D. Moving in a moving world: A review on vestibular motion sickness. Front. Neurol.; 2016; 7, 14. [DOI: https://dx.doi.org/10.3389/fneur.2016.00014]

9. Reason, J.T. Motion sickness adaptation: A neural mismatch model. J. R. Soc. Med.; 1978; 71, pp. 819-829. [DOI: https://dx.doi.org/10.1177/014107687807101109]

10. Medina Santiago, A.; Orozco Torres, J.A.; Hernández Gracidas, C.A.; Garduza, S.H.; Franco, J.D. Diagnosis and Study of Mechanical Vibrations in Cargo Vehicles Using ISO 2631-1: 1997. Sensors; 2023; 23, 9677. [DOI: https://dx.doi.org/10.3390/s23249677]

11. Wada, T. Motion sickness in automated vehicles. Advanced Vehicle Control; CRC Press: Boca Raton, FL, USA, 2016; pp. 169-174.

12. Diels, C.; Bos, J.E. Self-driving carsickness. Appl. Ergon.; 2016; 53, pp. 374-382. [DOI: https://dx.doi.org/10.1016/j.apergo.2015.09.009] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/26446454]

13. Paden, B.; Čáp, M.; Yong, S.Z.; Yershov, D.; Frazzoli, E. A survey of motion planning and control techniques for self-driving urban vehicles. IEEE Trans. Intell. Veh.; 2016; 1, pp. 33-55. [DOI: https://dx.doi.org/10.1109/TIV.2016.2578706]

14. Amer, N.H.; Zamzuri, H.; Hudha, K.; Aparow, V.R.; Kadir, Z.A.; Abidin, A.F.Z. Modelling and trajectory following of an armoured vehicle. Proceedings of the 2016 SICE International Symposium on Control Systems (ISCS); Nagoya, Japan, 7–10 March 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 1-6.

15. Zhao, P.; Chen, J.; Song, Y.; Tao, X.; Xu, T.; Mei, T. Design of a control system for an autonomous vehicle based on adaptive-pid. Int. J. Adv. Robot. Syst.; 2012; 9, 44. [DOI: https://dx.doi.org/10.5772/51314]

16. Park, M.W.; Lee, S.W.; Han, W.Y. Development of lateral control system for autonomous vehicle based on adaptive pure pursuit algorithm. Proceedings of the 2014 14th International Conference on Control, Automation and Systems (ICCAS 2014), Gyeonggi-do, Republic of Korea; 22–25 October 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 1443-1447.

17. Hoffmann, G.M.; Tomlin, C.J.; Montemerlo, M.; Thrun, S. Autonomous automobile trajectory tracking for off-road driving: Controller design, experimental validation and racing. Proceedings of the 2007 American Control Conference; New York, NY, USA, 9–13 July 2007; IEEE: Piscataway, NJ, USA, 2007; pp. 2296-2301.

18. Sharp, R. Rider control of a motorcycle near to its cornering limits. Veh. Syst. Dyn.; 2012; 50, pp. 1193-1208. [DOI: https://dx.doi.org/10.1080/00423114.2011.607899]

19. Yamashita, A.S.; Alexandre, P.M.; Zanin, A.C.; Odloak, D. Reference trajectory tuning of model predictive control. Control Eng. Pract.; 2016; 50, pp. 1-11. [DOI: https://dx.doi.org/10.1016/j.conengprac.2016.02.003]

20. Falcone, P.; Borrelli, F.; Asgari, J.; Tseng, H.E.; Hrovat, D. Predictive active steering control for autonomous vehicle systems. IEEE Trans. Control Syst. Technol.; 2007; 15, pp. 566-580. [DOI: https://dx.doi.org/10.1109/TCST.2007.894653]

21. Gutjahr, B.; Gröll, L.; Werling, M. Lateral vehicle trajectory optimization using constrained linear time-varying MPC. IEEE Trans. Intell. Transp. Syst.; 2016; 18, pp. 1586-1595. [DOI: https://dx.doi.org/10.1109/TITS.2016.2614705]

22. Siddiqi, M.R.; Milani, S.; Jazar, R.N.; Marzbani, H. Motion sickness mitigating algorithms and control strategy for autonomous vehicles. IEEE Trans. Intell. Transp. Syst.; 2022; 24, pp. 304-315. [DOI: https://dx.doi.org/10.1109/TITS.2022.3215175]

23. Amer, N.H.; Zamzuri, H.; Hudha, K.; Kadir, Z.A. Modelling and control strategies in path tracking control for autonomous ground vehicles: A review of state of the art and challenges. J. Intell. Robot. Syst.; 2017; 86, pp. 225-254. [DOI: https://dx.doi.org/10.1007/s10846-016-0442-0]

24. Shan, Y.; Yang, W.; Chen, C.; Zhou, J.; Zheng, L.; Li, B. CF-pursuit: A pursuit method with a clothoid fitting and a fuzzy controller for autonomous vehicles. Int. J. Adv. Robot. Syst.; 2015; 12, pp. 1-13. [DOI: https://dx.doi.org/10.5772/61391]

25. Zhu, Q.; Huang, Z.; Liu, D.; Dai, B. An adaptive path tracking method for autonomous land vehicle based on neural dynamic programming. Proceedings of the 2016 IEEE International Conference on Mechatronics and Automation; Harbin, China, 7–10 August 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 1429-1434.

26. Farag, W. Complex trajectory tracking using PID control for autonomous driving. Int. J. Intell. Transp. Syst. Res.; 2020; 18, pp. 356-366. [DOI: https://dx.doi.org/10.1007/s13177-019-00204-2]

27. Park, M.; Lee, S.; Han, W. Development of steering control system for autonomous vehicle using geometry-based path tracking algorithm. Etri J.; 2015; 37, pp. 617-625. [DOI: https://dx.doi.org/10.4218/etrij.15.0114.0123]

28. Lee, D.; Lee, S.J.; Yim, S.C. Reinforcement learning-based adaptive PID controller for DPS. Ocean Eng.; 2020; 216, 108053. [DOI: https://dx.doi.org/10.1016/j.oceaneng.2020.108053]

29. Ghafarian, M.; Watson, M.; Mohajer, N.; Nahavandi, D.; Kebria, P.M.; Mohamed, S. A review of dynamic vehicular motion simulators: Systems and algorithms. IEEE Access; 2023; 11, pp. 36331-36348. [DOI: https://dx.doi.org/10.1109/ACCESS.2023.3265999]

30. Zha, Y.; Deng, J.; Qiu, Y.; Zhang, K.; Wang, Y. A survey of intelligent driving vehicle trajectory tracking based on vehicle dynamics. SAE Int. J. Veh. Dyn. Stab. NVH; 2023; 7, pp. 221-248. [DOI: https://dx.doi.org/10.4271/10-07-02-0014]

31. Chen, S.; Chen, H.; Negrut, D. Implementation of MPC-based path tracking for autonomous vehicles considering three vehicle dynamics models with different fidelities. Automot. Innov.; 2020; 3, pp. 386-399. [DOI: https://dx.doi.org/10.1007/s42154-020-00118-w]

32. Mattingley, J.; Boyd, S. CVXGEN: A code generator for embedded convex optimization. Optim. Eng.; 2012; 13, pp. 1-27. [DOI: https://dx.doi.org/10.1007/s11081-011-9176-9]

33. Merabti, H.; Belarbi, K.; Bouchemal, B. Nonlinear predictive control of a mobile robot: A solution using metaheuristcs. J. Chin. Inst. Eng.; 2016; 39, pp. 282-290. [DOI: https://dx.doi.org/10.1080/02533839.2015.1091276]

34. Shladover, S.E.; Desoer, C.A.; Hedrick, J.K.; Tomizuka, M.; Walrand, J.; Zhang, W.B.; McMahon, D.H.; Peng, H.; Sheikholeslam, S.; McKeown, N. Automated vehicle control developments in the PATH program. IEEE Trans. Veh. Technol.; 1991; 40, pp. 114-130. [DOI: https://dx.doi.org/10.1109/25.69979]

35. Kapania, N.R.; Gerdes, J.C. Design of a feedback-feedforward steering controller for accurate path tracking and stability at the limits of handling. Veh. Syst. Dyn.; 2015; 53, pp. 1687-1704. [DOI: https://dx.doi.org/10.1080/00423114.2015.1055279]

36. Ly, A.O.; Akhloufi, M. Learning to drive by imitation: An overview of deep behavior cloning methods. IEEE Trans. Intell. Veh.; 2020; 6, pp. 195-209. [DOI: https://dx.doi.org/10.1109/TIV.2020.3002505]

37. Chitta, K.; Prakash, A.; Jaeger, B.; Yu, Z.; Renz, K.; Geiger, A. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving. IEEE Trans. Pattern Anal. Mach. Intell.; 2022; 45, pp. 12878-12895. [DOI: https://dx.doi.org/10.1109/TPAMI.2022.3200245] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/35984797]

38. Ye, Y.; Qiu, D.; Wang, H.; Tang, Y.; Strbac, G. Real-time autonomous residential demand response management based on twin delayed deep deterministic policy gradient learning. Energies; 2021; 14, 531. [DOI: https://dx.doi.org/10.3390/en14030531]

39. Puterman, M.L. Markov decision processes. Handbooks Oper. Res. Manag. Sci.; 1990; 2, pp. 331-434.

40. Golding, J.; Markey, H.; Stott, J. The effects of motion direction, body axis, and posture on motion sickness induced by low frequency linear oscillation. Aviat. Space Environ. Med.; 1995; 66, pp. 1046-1051.

41. Donohew, B.E.; Griffin, M.J. Motion sickness: Effect of the frequency of lateral oscillation. Aviat. Space, Environ. Med.; 2004; 75, pp. 649-656.

Word count: 8068

Show less

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Path tracking is an essential task for autonomous driving (AD), for which controllers are designed to issue commands so that vehicles will follow the path of upper-level decision planning properly to ensure operational safety, comfort, and efficiency. Current path-tracking methods still face challenges in balancing tracking accuracy with computational overhead, and more critically, lack consideration for Motion Sickness (MS) mitigation. However, as AD applications divert occupants’ attention to non-driving activities at varying degrees, MS in self-driving vehicles has been significantly exacerbated. This study presents a novel framework, the Hybrid Supervised–Reinforcement Learning (HSRL), designed to reduce passenger discomfort while achieving high-precision tracking performance with computational efficiency. The proposed HSRL employs expert data-guided supervised learning to rapidly optimize the path-tracking model, effectively mitigating the sample efficiency bottleneck inherent in pure Reinforcement Learning (RL). Simultaneously, the RL architecture integrates a passenger MS mechanism into a multi-objective reward function. This design enhances model robustness and control performance, achieving both high-precision tracking and passenger comfort optimization. Simulation experiments demonstrate that the HSRL significantly outperforms Proportional–Integral–Derivative (PID) and Model Predictive Control (MPC), achieving improved tracking accuracy and significantly reducing passengers’ cumulative Motion Sickness Dose Value (MSDV) across several test scenarios.

Details

Title

Hybrid Supervised and Reinforcement Learning for Motion-Sickness-Aware Path Tracking in Autonomous Vehicles

Author

Lv Yukang¹; Chen, Yi¹; Chen, Ziguo¹

; Fan Yuze¹

; Tao Yongchao¹; Zhao, Rui¹

; Gao Fei²

¹ College of Automotive Engineering, Jilin University, Changchun 130025, China; [email protected] (Y.L.); [email protected] (Y.C.); [email protected] (Z.C.); [email protected] (Y.F.); [email protected] (Y.T.); [email protected] (R.Z.)
² College of Automotive Engineering, Jilin University, Changchun 130025, China; [email protected] (Y.L.); [email protected] (Y.C.); [email protected] (Z.C.); [email protected] (Y.F.); [email protected] (Y.T.); [email protected] (R.Z.), National Key Laboratory of Automotive Chassis Integration and Bionics, Jilin University, Changchun 130025, China

First page

3695

Publication year

2025

Publication date

2025

Publisher

MDPI AG

e-ISSN

14248220

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/s25123695

ProQuest document ID

3223942093

Hybrid Supervised and Reinforcement Learning for Motion-Sickness-Aware Path Tracking in Autonomous Vehicles

Jump to:

Full text

1. Introduction

2. Related Works

2.1. Geometric Controllers

2.2. Model-Free Controllers

2.3. Model-Based Controllers

3. Methodology

3.1. Problem Definition

3.1.1. Input Representation

3.1.2. Output Representation

3.2. Method Framework

3.3. Offline Supervised Learning

3.3.1. Expert Demonstration

3.3.2. Loss Function

3.4. Online RL Framework

3.4.1. Markov Decision Process

3.4.2. Reward Function

3.4.3. RL Optimization Process

3.5. Path-Tracking Algorithm Based on HSRL

4. Experiment

4.1. Experimental Setup

4.1.1. Implementation Details

4.1.2. Training Dynamics Analysis

4.1.3. Baselines and Evaluation Metrics

4.2. Tracking Performance

4.3. Reduction in MS Performance

4.4. High-Speed Performance

4.5. Reward Hyperparameter Sensitivity Analysis

4.6. Ablation Study

5. Conclusions

Abstract

Details

Suggested sources