Content area

Abstract

We investigate the convergence properties of policy iteration and value iteration algorithms in reinforcement learning by leveraging fixed-point theory, with a focus on mappings that exhibit weak contractive behavior. Unlike traditional studies that rely on strong contraction properties, such as those defined by the Banach contraction principle, we consider a more general class of mappings that includes weak contractions. Employing Zamfirscu’s fixed-point theorem, we establish sufficient conditions for norm convergence in infinite-dimensional policy spaces under broad assumptions. Our approach extends the applicability of these algorithms to feedback control problems in reinforcement learning, where standard contraction conditions may not hold. Through illustrative examples, we demonstrate that this framework encompasses a wider range of operators, offering new insights into the robustness and flexibility of iterative methods in dynamic programming.

Full text

Turn on search term navigation

1. Introduction

In [1], Richard Bellman presented the concept of dynamic programming, demonstrating its utility in addressing complex decision-making problems by breaking them into simpler coupled sub-problems that are solved iteratively over time. Dynamic programming techniques have played a pivotal role in optimization-based feedback control and have since been expanded to encompass a wide range of control problems, including impulsive control, as noted in [2,3,4] and the related works.

A notable set of such iterative techniques are referred to as reinforcement learning (RL) algorithms. In their work, Bertsekas and Tsitsiklis [5] presented a comprehensive collection of RL algorithms, organized within the frameworks of value iteration (VI) and policy iteration (PI) methods.

In [6], Bertsekas and Ioffe explored the temporal differences policy iteration (TD(λ)) approach within the neuro-dynamic programming context, showing that the TD(λ) method can be incorporated into a PI framework known as λ-PIR. Subsequently, in [7], Bertsekas investigated the connection between TD(λ) and proximal algorithms, which are especially useful for tackling convex optimization challenges.

In [8], Yachao et al. expanded the λ-PIR framework from finite policy scenarios to contractive models with infinite policy spaces, utilizing abstract dynamic programming techniques. They defined precise compact operators essential for the operation of the algorithm and determined the conditions under which the λ-PIR method achieves almost certain convergence for problems involving infinite-dimensional policy spaces.

In [9], Belhenniche et al. explored the characteristics of reinforcement learning techniques designed for feedback control within the context of fixed-point theory. By making relatively broad assumptions, they derived sufficient conditions that ensure almost sure convergence in scenarios involving infinite-dimensional policy spaces.

Fixed-point theory provides a robust framework with wide-ranging applications across fields such as topology, nonlinear analysis, optimal control, and machine learning. A fundamental result in this domain is the Banach contraction principle, established by Banach [10], which states that, if (X,d) is a complete metric space and T:XX is a mapping that satisfies

(1)d(Tx,Ty)γd(x,y),

for some 0γ<1 and all x,yX, then T has a unique fixed point x*. Furthermore, for any starting point x0X, the sequence {xn} generated by iteration xn+1=Txn converges to x*. Numerous variations and generalizations of the Banach contraction principle have been thoroughly explored in various mathematical settings. Notably, Kannan [11] proposed a completely different type of contraction that also ensures the existence of a unique fixed point for the associated operator.

Kannan contractive mappings are less restrictive than Banach contractions as the contraction condition depends on the distances to the images (not just between points). This can be useful when dealing with operators in Markov Decision Processes (MDPs), where strict Banach contraction properties may not hold, particularly in the context of λ-policy iteration.

In [12], Vetro introduced an extension of Hardy–Rogers-type contractions, proving fixed-point theorems for self-mappings defined on metric spaces and ordered metric spaces. These results were also applied to multistage decision processes, demonstrating their applicability in optimization and decision-making frameworks. The framework established in [12] highlights the importance of generalized contractions in iterative decision-making processes, particularly in scenarios where classical contraction conditions do not apply directly.

Building on the foundations of fixed-point theory, this work extends and generalizes the results in [8]. Specifically, we demonstrate that the properties of the iterative process proposed in [8] remain valid for weakly contractive mappings, which represent a significantly broader class of systems than those previously considered, including the models studied in previous research.

A mapping T:XX is defined as a Kannan mapping if there exists 0γ<12 such that

(2)d(Tx,Ty)γ(d(x,Tx)+d(y,Ty)),

for all x,yX.

Later, Chatterjea [13] proposed a related contractive condition, expressed as follows:

(3)d(T(x),T(y))γ(d(x,Ty)+d(y,Tx)),

where 0γ<12 and x,yX.

Expanding on these contraction principles, Zamfirscu [14] introduced a broader class of contractive mappings that unify and generalize the conditions given by Banach, Kannan, and Chatterjea. This is expressed as follows: let X be a complete metric space, α,β,γ be real numbers with α<1, β<12, γ<12, and T:XX be a function such that, for each pair of different functions x,yX, at least one of the following conditions is satisfied:

(1). TxTyαxy,

(2). TxTyβxTx+yTy,

(3). TxTyγxTy+yTx.

Then, T has a unique fixed point. These generalized contraction principles form the basis for the results presented in this study.

Building on these theoretical foundations, our work advances reinforcement learning algorithms by developing convergence guarantees for weakly contractive mappings via Zamfirescu’s fixed-point theorem, significantly broadening the applicability beyond classical Banach contraction settings. We further demonstrate norm convergence in infinite-dimensional policy spaces under substantially relaxed assumptions, overcoming longstanding limitations in analyzing practical reinforcement learning systems. This framework successfully handles several critical challenges, including non-contractive operators, continuous-state and -action spaces inducing infinite policy dimensions, and discontinuous operators frequently encountered in real-world application cases where conventional fixed-point methods typically fail.

Dynamic programming operators, like the Bellman operator in Markov Decision Processes (MDPs), often do not meet the Banach contraction conditions due to the inherent structure of the problem, such as the presence of high discount factors or incomplete policy evaluations. In these situations, Zamfirscu’s fixed-point theorem offers a more flexible and powerful framework, enabling the establishment of convergence where one of the three contractions may fail. This alternative approach is particularly relevant for handling complex problems like those encountered in λ-policy iteration and other reinforcement learning contexts, where weak contraction properties may hold.

Zamfirscu’s fixed-point theorem is particularly suitable for reinforcement learning because it deals with weakly contractive mappings, which are more general than standard Banach contractions. In reinforcement learning, especially in scenarios with high discount factors or non-standard reward structures, the operators involved may not satisfy the strong contraction conditions of Banach’s theorem. However, they might still satisfy the weaker conditions provided by Zamfirscu’s theorem, ensuring convergence of iterative algorithms.

In this paper, we employ fixed-point theory techniques to explore policy iteration and value iteration algorithms for mappings classified as weak contractions. By leveraging Zamfirescu’s theorem, we ensure the existence and uniqueness of solutions as well as the convergence of the involved operators. This approach is particularly significant for its applicability to discontinuous operators, offering a broader framework than traditional Banach or Kannan contractions.

Building on this foundation, we extend the convergence analysis of reinforcement learning algorithms—specifically value iteration (VI) and policy iteration (PI)—to a wider class of operators, satisfying Zamfirescu’s weak contraction conditions. Prior works, such as [8,9], rely on stricter contraction assumptions or compact operator properties, limiting their applicability. In contrast, our use of Zamfirescu’s theorem [14] accommodates discontinuous operators and non-uniform contraction constants, as demonstrated in our robotic navigation example (Section 5). This flexibility allows convergence guarantees in dynamic programming problems with high discount factors or irregular reward structures, common in feedback control. Moreover, we provide explicit sufficient conditions for norm convergence in infinite-dimensional policy spaces, enhancing the practicality of VI and PI in complex reinforcement learning tasks, such as real-time adaptive control.

The recent advances in policy iteration methods have evolved along two primary paths. The first involves viscosity approaches for regularized control problems [15], while the second focuses on constrained optimization frameworks [16]. Unlike these works, which assume Lipschitz continuity or explicit control constraints, our approach ensures convergence under milder conditions using fixed-point methods. This makes it especially well suited for reinforcement learning applications with sparse rewards or discontinuous dynamics.

The structure of this paper is as follows. In the upcoming section, we define the iterative feedback control problem under investigation, outlining the core definitions and the necessary assumptions that the data must meet. In the following section, Section 3, we introduce several fixed-point results that form the foundation of our contributions, which will also be presented in this section. The value iteration and policy iteration framework is developed in Section 4, within the framework of Banach, Kannan, and Chatterjea operators, and under the perspective of Zamfirescu’s fixed-point theorem. In Section 5, we provide an application of our findings in real-world scenarios, demonstrating their practical relevance. Lastly, we briefly discuss the conclusions and potential directions for future research in Section 6.

2. Preliminaries

We examine a collection of states denoted by X, a collection of controls denoted by U, and, for each state xX, a non-empty subset of controls U(x)U. The set M is defined as the collection of all functions μ:XU satisfying μ(x)U(x) for every xX, which we will refer to as a policy in the subsequent analysis.

Let υ(X) denote the space of real-valued functions V:XR, where all outputs are finite. Its extension, υ¯(X), represents the space of extended real-valued functions V:XR¯, where R¯=R{,+} incorporates infinite values. This distinction separates conventional function spaces (υ(X)) from their generalized counterparts (υ¯(X)), which are essential for handling unbounded behavior in optimization, measure theory, and related fields.

Let υ(X) represent the set of functions V:XR, and let υ¯(X) represent the set of functions V:XR¯, where R¯=R{,}. We analyze an operator defined as follows:

H:X×U×υ(X)R,

and, for each policy μM, we introduce a mapping Fμ:υ(X)υ(X) given by

FμV(x):=H(x,μ(x),V),forallxX.

Next, we define a mapping F:υ(X)υ¯(X) as

FV(x):=infμM{FμV(x)},forallxX.

Based on the definitions above, we can derive

FV(x)=infμM{H(x,μ(x),V)}=infuU{H(x,u,V)}.

For a given positive function ν:XR, we define B(X) as the set of functions V for which V<, with the norm · on B(X) specified as

V=supxX|V(x)|ν(x).

We assume that ν(x) is positive, bounded, and satisfies infxXν(x)>0 to guarantee that V remains finite, even when X is unbounded. The weighting function ν(x) normalizes |V(x)| spatially via |V(x)|ν(x) and ensures V< under infxXν(x)>0, making B(X) a proper normed space even for unbounded X or growing V(x).

The following lemma naturally follows from these definitions:

Lemma 1

([8]). The space B(X) is complete under the topology defined by the norm ·.

Additionally, it is evident that B(X) is both closed and convex. Therefore, for a sequence {Vk}k=1B(X) and a function VB(X), if Vk converges to V such that limkVkV=0, it follows that limkVk(x)=V(x) for every xX.

We now present the following standard assumptions.

Assumption 1

([8] (Well-posedness)). For every VB(X) and every policy μM, it holds that FμVB(X) and FVB(X). This assumption ensures that the value functions remain bounded under the given norm, which is typical in discounted MDPs commonly used in reinforcement learning to model decision-making processes.

Definition 1

([11] (Kannan’s mapping)). A self-map Fμ of B(X) is called a Kannan’s mapping if there exists a constant γ[0,12) such that

(4)FμVFμVγVFμV+VFμV,V,VB(X).

One immediate consequence of Definition 1 is that, if each policy evaluation operator Fμ is a Kannan contraction with modulus γ, then the Bellman optimality operator F, defined as

FV(x):=infμM(FμV)(x),

is also a Kannan contraction with the same modulus γ. To show this, observe that

(FμV)(x)v(x)(FμV)(x)v(x)+γVFμV+VFμV,xX,μM.

Taking infimum over μM yields

FV(x)v(x)FV(x)v(x)+γVFμV+VFμV.

Since FV=infμFμV, it follows that VFμV  VFV; hence,

FV(x)v(x)FV(x)v(x)+γVFV+VFV,xX.

By symmetry, the reverse inequality also holds as follows:

FV(x)v(x)FV(x)v(x)+γVFV+VFV.

Combining both yields the Kannan contraction inequality for F:

FVFV  γVFV+VFV,

thereby confirming that F is a Kannan contraction. This is relevant to RL as it allows convergence analysis in policy evaluation steps, where operators may not be strongly contractive but still exhibit weaker contractive properties suitable for infinite state–action spaces.

This kind of mapping is very important in metric fixed-point theory. It is well known that Banach’s contraction mappings are continuous, while Kannan’s mappings are not necessarily continuous. Kannan’s theorem is important because Subrahmanyam in [17] proved that Kannan’s theorem characterizes the metric completeness. That is, a metric space X is complete if and only if every mapping satisfying (2) on X with 0γ<12 has a fixed point. Banach’s contraction does not have this property.

Example 1

([11]). Let E={VB(X):0V(x)1} and FV(x)=V(x)4 if V(x)[0,12) and FV(x)=V(x)5 if V(x)[12,1]. Here, F is discontinuous at V(x)=12; consequently, Banach contraction is not satisfied. But, it is easily seen that Kannan’s map is satisfied by taking α=49.

Example 2.

Let X=R be a metric space and T:XX be defined by

T ( x ) = 0 , x 2 , 1 2 , x > 2 .

Then, T is not continuous on X and T satisfies contractive condition (2) with γ=15.

Definition 2

([13] (Chatterjea’s mapping)). A self-map Fμ of B(X) is called Chatterjae’s mapping if there exist γ[0,12) such that

(5)FμVFμV  γ(VFμV+VFμV).

This inequality holds for all V,VB(X).

Notation Summary

The key mathematical notations used throughout this paper are summarized in Table 1.

3. Auxiliary Results

In this section, a number of results that will be instrumental in the proof of the main result of this article are presented.

Theorem 1

([11,14] (Existence)). Suppose that the operators F,Fμ:B(X)B(X) are Kannan mappings. Then, there exist fixed points V* for F and Vμ for Fμ, respectively.

Based on the result of Theorem 1, we can establish the following:

Lemma 2

([11]). The following properties are satisfied: (i) 

Starting with any V0B(X), the sequence {Vk} generated by the iteration Vk+1=FμVk converges to Vμ in the norm.

(ii) 

Starting with any V0B(X), the sequence {Vk} generated by the iteration Vk+1=FVk converges to V* in the norm.

The results of Lemma 2 highlight two fundamental iterative processes in dynamic programming: value iteration (VI) and policy iteration (PI). Value iteration is an approach in which the value function is updated iteratively using the Bellman optimality operator F until it converges to the optimal value function V*. This approach does not directly maintain a policy at each iteration but instead progressively improves the value function, which can then be used to derive the optimal policy. In contrast, policy iteration operates through two alternating phases: policy evaluation and policy enhancement. The evaluation phase, as indicated in the lemma by Vk+1=FμVk, focuses on calculating the value function Vμ for a given policy μ. Following this, the enhancement phase updates the policy to minimize the expected cost, and this cycle continues until the optimal policy is achieved. Unlike value iteration, which directly refines the value function, policy iteration employs systematic updates via policy evaluation and enhancement, often resulting in quicker convergence in real-world scenarios.

As noted above, the findings of Lemma 2 provide a foundation for value iteration (VI). However, to guarantee the success of policy iteration (PI), further assumptions must be introduced.

Assumption 2

([8] (Monotonicity)). For any V,VB(X), if VV, then it follows that

H(x,u,V)H(x,u,V),xX,uU(x),

where the inequality ≤ is understood in a pointwise manner throughout X.

Assumption 3

([8] (Attainability)). For every VB(X), there exists a policy μM such that FμV=FV.

Clarification of Contraction Assumptions

To ensure convergence of value iteration (VI) and policy iteration (PI), we precisely define the contraction properties assumed for the Bellman optimality operator F and the policy evaluation operator Fμ, consistently applied across Definition 1, Theorem 1, and all convergence results.

4. Main Results

In this section, we employ the abstract dynamic programming (DP) framework developed in [18], integrating both the policy iteration (PI) and value iteration (VI) methods to extend finite-policy problems to three contractive models with infinite policy spaces. These extensions are analyzed through the theoretical lens of Zamfirescu’s fixed-point theorem. Infinite policy spaces naturally emerge in two key scenarios: (1) systems with infinite state spaces or (2) finite state spaces with uncountable control action sets. To rigorously address these cases, we investigate the following generalization of contractive mappings:

Theorem 2 

([14]). Let B(X) be a Banach space, α,β,γ real numbers with α<1, β<12, γ<12, and F,Fμ:B(X)B(X) a functions such that, for each couple of different functions V,VB(X), at least one of the following conditions is satisfied: (C1) 

FμVFμV  αVV,

(C2) 

FμVFμV  βVFμV+VFμV,

(C3) 

FμVFμV  γVFμV+VFμV.

Then, there exist fixed points V* for F and Vμ for Fμ, respectively.

Proof. 

We apply the same methodology as Zamfirscu [14], but within the framework of dynamic programming notation. Consider a number

δ=maxα,β1β,γ1γ.

Obviously, δ<1. Now, let us choose a random V(x0)M and fix an integer number n0. Take V(x)=FμnV0(x0) and V(x)=Fμn+1V0(x0). Suppose V(x)V(x); otherwise, V(x) is a fixed point of Fμ. If for these two points condition (C1) is satisfied, then

Fμn+1VFμn+2VαFμnVFμn+1VδFμnVFμn+1V.

If for V(x),V(x) condition (C2) is verified, then

Fμn+1VFμn+2Vβ(FμnVFμn+1V+Fμn+1VFμn+2V),

which implies

Fμn+1VFμn+2Vβ1βFμnVFμn+1VδFμnVFμn+1V.

In case condition (C3) is satisfied,

Fμn+1VFμn+2Vγ(FμnVFμn+1V+Fμn+1VFμn+2V),

implying the contraction:

Fμn+1VFμn+2VδFμnVFμn+1V.

This inequality, true for every n, clearly implies that {FμnV}n=0 is a Cauchy sequence and therefore converges to some point V*B(X). Since Fμ is not necessarily continuous, let us assume that V*FμV*=r>0, Then, we can write the following estimate:

V*FμV*  V*Vn+1 + Vn+1FμV*,= V*Vn+1 + FμVnFμV*, V*Vn+1 + δVnV*.

As both V*Vn+1 and VnV* tend to 0 as n+, it follows that r<0, hence, a contradiction. Then, we have V*=FμV*.

Now, we show that this fixed point V* is unique. Suppose this is not true; then, we can set FμW=W for some point WV* different from V*. Then,

FμV*FμW=V*W,

FμV*FμW>V*FμV*+WFμW,

FμV*FμW=12(V*FμW+WFμV*),

ensuring that none of the three conditions of Theorem 2 are met at the points V* and W. □

The Zamfirscu theorem introduces three different types of contractions: Banach, Kannan, and Chatterjea contractions. The Banach-type contraction (α-condition) ensures that distances shrink by a fixed factor, guaranteeing a unique fixed point. The Kannan-type contraction (β-condition), instead of comparing two points directly, considers distances from each point to its image under the function, an essential property in dynamic programming (DP) models where strict contraction may not hold yet iterative updates still lead to convergence. The Chatterjea-type contraction (γ-condition) introduces a symmetric relationship between distances, providing additional flexibility in systems where direct pointwise contraction is too restrictive.

Indirectly, Banach contractions impose a uniform reduction in distance, making them more restrictive and often inapplicable in RL settings with discontinuous operators or high discount factors. In contrast, Kannan contractions relax this requirement by focusing on distances to mapped points, offering a broader scope that aligns with the iterative nature of RL algorithms like value iteration, where convergence can occur even without continuity. Zamfirscu’s theorem unifies these, providing a robust framework where at least one condition holds, enhancing applicability over relying solely on Banach or Kannan alone.

The advantage of this formulation is its adaptability: if one contraction condition fails, another may still hold, ensuring the fixed-point property. This makes the theorem particularly valuable in DP models, especially when dealing with infinite policy spaces that arise from either continuous-state spaces or uncountable action sets, where establishing a single global contraction is often impractical.

Corollary 1

(Banach [10]). Let B(x) be a complete metric space, 0α<1, and Fμ:B(X)B(X) a function such that, for each couple of different points in B(X), condition (C1) of Theorem 2 is verified. Then, Fμ has a unique fixed point.

Corollary 2

(Kannan [11]). Let B(X) be a complete metric space, 0β<12, and Fμ:B(X)B(X) a function such that, for each couple of different points in B(X), condition (C2) of Theorem 2 is verified. Then, Fμ has a unique fixed point.

Corollary 3

(Chatterjea [13]). Let B(X) be a complete metric space, 0γ<12, and Fμ:B(X)B(X) a function such that, for each couple of different points in B(X), condition (C3) of Theorem 2 is verified. Then, Fμ has a unique fixed point.

The following proposition presents key implications of Zamfirescu’s Theorem 2.

Proposition 1.

Let Theorem 2 hold; then, (a) 

For any VB(X):

(6) V * V   1 1 δ F V V .

(b) 

For any VB(X) and μM:

(7) V μ V   1 1 δ F μ V V .

Proof. 

Let us consider Vn=FVn1 as we have

FVnFVn1  δFVn1FVn2.

Then,

FVnFVn1  δn1FV1FV0.

Now, we are going to use the triangle inequality, where n represents the iteration index in the sequence {FVn}n=1, which is generated by applying the operator F iteratively. Proceeding as follows:

FVnVk=1nFVkFVk1k=1nδk1FVV.

Taking the limit as k+ and using Lemma 2, we obtain

V*V11δFVV.

Brief Justification:

The bounds in Proposition 1 follow from three key observations:

1. Zamfirescu’s Contraction: The operator F satisfies

FVFWδVWwhereδ=maxα,β1β,γ1γ<1,

derived from conditions (C1–C3) in Theorem 1.

2. Iterative Shrinkage: For the sequence Vn=FnV, we have

Vn+1Vn  δVnVn1 δnFVV,

showing each iteration reduces distance by factor δ.

3. Geometric Convergence: The fixed-point error bound emerges from summing the geometric series:

V*V k=1VkVk1 11δFVV.

The identical logic applies to Fμ due to uniform contraction across M. The factor 11δ quantifies how contraction strength governs approximation error.

Remark 1.

Based on Proposition 1, if we set V=V*, it can be shown that, for any ϵ>0, there exists a policy μϵM such that the norm VμϵV*ϵ. This can be achieved by selecting μϵ(x) to minimize H(x,u,V*) over U(x) with an error not exceeding (1δ)ϵν(x) for all xX. Specifically, we obtain

V μ ϵ V * 1 1 δ F μ ϵ V * V * = 1 1 δ F μ ϵ V * F V * ϵ .

Consequently, this implies

| F μ ϵ ( x ) V * ( x ) F V * ( x ) | ( 1 δ ) ϵ ν ( x ) .

The significance of monotonicity and Zamfirescu’s theorem is illustrated by proving that V*, the fixed point of F, represents the infimum over all μM of Vμ, which is the unique fixed point of Fμ.

Proposition 2.

Let the monotonicity and Zamfirscu’s theorem conditions hold. Then, for all xX, we have

V * ( x ) = inf μ M V μ ( x ) .

Proof. 

We will establish the two inequalities that define the infimum.

First inequality: V*(x)infμMVμ(x).

For all μM, we have FV*FμV*. Since FV*=V* by the definition of the fixed point, this simplifies to

V*FμV*.

Applying the operator Fμ repeatedly to both sides and using the monotonicity assumption, we obtain

V*FμkV*forallk>0.

Taking the limit as k and assuming the sequence FμkV* converges, it follows that

V*VμforallμM.

Thus, we have V*(x)infμMVμ(x).

Second inequality: V*(x)infμMVμ(x).

Using Remark 1, for any ϵ>0, there exists μϵM such that

VμϵV*+ϵ.

Thus, for each ϵ>0, we have

infμMVμ(x)Vμϵ(x)V*(x)+ϵ.

Taking the limit as ϵ0, we obtain

infμMVμ(x)V*(x).

Therefore, V*(x)infμMVμ(x).

Combining these two inequalities, we obtain

V*(x)=infμMVμ(x)forallxX.

These results show that the fixed point of F acts as a lower bound for policy-dependent fixed points Vμ, ensuring the optimal value function can be approximated within a controlled error. The upper bounds from Zamfirescu’s theorem highlight convergence and stability, key to feedback control in dynamic programming and optimization.

4.1. Algorithmic Framework

Based on Theorem 2, we describe the implementation of value iteration (VI) and policy iteration (PI) for weakly contractive operators:

Value Iteration Algorithm:

1. Initialize V0B(X) (typically V0(x)=0 for all xX).

2. Set convergence threshold ϵ>0 and maximum iterations Kmax.

3. For k=0,1,,Kmax1:

Compute Vk+1(x)=FVk(x)=infuU(x)H(x,u,Vk) for all xX.

If Vk+1Vk <ϵ(1δ)/δ, return Vk+1 as the approximate fixed point V*.

4. Return VKmax (with warning if unconverged).

Policy Iteration Algorithm:

1. Initialize policy μ0M arbitrarily.

2. Set convergence threshold ϵ>0 and maximum iterations Kmax.

3. For k=0,1,,Kmax1:

(a). Policy Evaluation:

Initialize V0μkB(X).

Repeat:

. Compute Vn+1μk(x)=FμkVnμk(x)=H(x,μk(x),Vnμk).

. Until VnμkVn1μk <ϵ(1δ)/δ.

(b). Policy Improvement:

Update μk+1(x)=argminuU(x)H(x,u,Vnμk) for all xX.

(c). If μk+1=μk, return (Vnμk,μk) as optimal solution.

4. Return (VμKmax,μKmax) (with warning if unconverged).

Key features of these algorithms:

The stopping criterion ϵ(1δ)/δ derives from Proposition 1.

Policy evaluation leverages weak contraction properties (C1)–(C3) from Theorem 2.

Convergence is guaranteed even when standard contraction factors approach 1.

4.2. Examples

Consider the Banach space defined as

B(X)=VB(X):0<V(x)<2,

with the norm V=supxX|V(x)|. Define the function Fμ:B(X)B(X) as

FμV=V4,VS1=VB(X):0V(x)1,12V4,VS2=VB(X):1<V(x)2.

For all V,VB(X), the function satisfies the contractive inequality:

FμVFμV  δmaxVV,VFμV+VFμV2,VFμV+VFμV2.

With δ=12. Therefore, at least one of the three conditions in Zamfirscu’s theorem holds, ensuring a unique fixed point.

A fixed point of F satisfies V*=FμV*, leading to the following cases: if V*S1,

V*=V*4V*V*4=0V*114=0.

This contradiction with 0<V*<2 implies V*S1. If V*S2,

V*=12V*4.

Solving for V*,

V*+V*4=1254V*=12V*=25.

Since V*=25 lies in (0,2), it is the unique fixed point.

Given that V* represents the optimal cost-to-go function, the optimal feedback control follows by minimizing

u*(x)=argminuC(x,u)+δV*(f(x,u)).

For a cost function C(x,u)=x2+u2 and dynamics f(x,u)=ax+bu, the optimal control law is

u*(x)=Kx,K=b1λa.

With V*=25, explicit control computation is achievable, demonstrating the theorem’s application in control theory. Zamfirscu’s theorem ensures convergence, highlighting its significance in solving infinite policy DP models.

5. Application

Consider a robotic navigation task in a bounded environment, where the goal is to optimize a warehouse robot’s path to a docking station while minimizing energy consumption. Let X=[0,1] represent the normalized position space, U=[0,1] denote the control input space (e.g., speed), and B(X)={VB(X):0<V(x)<2} be the Banach space of value functions with norm V=supxX|V(x)|. The operator H(x,u,V) models the cost-to-go as

(8)H(x,u,V)=c(x,u)+γV(f(x,u)),

where c(x,u)=x2+u2, γ=0.5 is the discount factor, and f(x,u)=x+u(1x). Define the policy evaluation operator FμV(x)=x2+μ(x)2+0.5V(x+μ(x)(1x)) and the Bellman optimality operator FV(x)=infu{x2+u2+0.5V(x+u(1x))}.

To compare Zamfirscu’s and Banach’s approaches, consider a piecewise operator for μ(x)=0.5:

(9)FμV(x)=x2+0.25+0.5·V(x)4,V(x)S1={VB(X):0V(x)1},x2+0.25+0.5·12V(x)4,V(x)S2={VB(X):1<V(x)2}.

This operator satisfies Zamfirscu’s conditions:

(10)FμVFμV δmaxVV,VFμV + VFμV2,VFμV + VFμV2,

with δ=0.5. For Banach’s approach, we test a modified operator:

(11)FμVFμV0.9VV.

The fixed point at x=0.5 in S1 is

(12)V*(0.5)=0.5+0.125V*(0.5)0.875V*(0.5)=0.5V*(0.5)0.571.

The optimal control is

(13)u*(x)=argminux2+u2+0.5·0.571(0.5x+0.5u),

yielding u*(x)=Kx, K0.667.

To validate Zamfirscu’s flexibility versus Banach’s limitations, we analyze convergence with V0(x)=1. For FμV(x)=x2+0.25+0.5·V(x)4 in S1 at x=0.5, Zamfirscu uses

Banach-type: FμVFμV 0.125VV.

Kannan-type: FμVFμV 0.2(VFμV+VFμV).

Chatterjea-type: FμVFμV 0.2(VFμV+VFμV).

Zamfirscu switches conditions dynamically, ensuring convergence. For Banach, the operator’s discontinuity at V=1 causes failure. Consider V(0.5)=1, V(0.5)=1+ϵ:

(14)FμV(0.5)=0.625,FμV(0.5)=0.6250.125ϵ,

(15)|FμV(0.5)FμV(0.5)||V(0.5)V(0.5)|=0.125ϵϵ=0.125,

However, boundary effects amplify the effective contraction factor. As shown in Figure 1, the operator Fμ exhibits piecewise behavior with a stable fixed point at V*0.571 under Zamfirscu’s framework, while Banach’s method becomes unstable near V=1. This instability is further quantified in Table 2, where Banach’s single-rate contraction (α=0.9) yields an error of 0.031 after 10 iterations significantly higher than Zamfirscu’s adaptive methods (errors 0.00016).

These results highlight that Zamfirscu’s approach is useful. The Banach’s approach fails due to discontinuity at V=1 (see Appendix A), amplifying the contraction factor in the discretized settings, while Zamfirscu’s condition-switching ensures robust convergence for robotic navigation.

6. Conclusions

In this article, we extended policy iteration and value iteration algorithms to mappings that are merely weak contractions, broadening their applicability beyond traditional strong contractions like Banach mappings. By leveraging Zamfirscu’s fixed-point theorem, we established sufficient conditions for the existence, uniqueness, and convergence of solutions in infinite-dimensional policy spaces. Our findings highlight the flexibility of fixed-point methods in reinforcement learning and dynamic programming, particularly for handling discontinuous operators. Our ongoing research aims to explore further generalizations, such as lambda policy iteration with randomization, alongside stochastic system extensions incorporating probabilistic transitions in Markov Decision Processes and control engineering applications like real-time adaptive control in robotics or traffic management systems to enhance the practical impact of these theoretical advancements.

Author Contributions

Conceptualization, A.B. and R.C.; methodology, A.B.; validation, R.C.; formal analysis, A.B., R.C. and R.G.; investigation, A.B. and R.C.; writing—original draft preparation, A.B.; writing—review and editing, R.C. and R.G.; visualization, A.B. and R.C.; funding acquisition, R.C. All authors have read and agreed to the published version of the manuscript.

Data Availability Statement

The datasets analyzed in this paper are not readily available due to technical limitations.

Conflicts of Interest

The authors declare no conflicts of interest.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figure and Tables

Figure 1 Fμ at x=0.5, showing piecewise behavior (blue for V1; red for V>1). Zamfirscu stabilizes at V*0.571 (black dot), unlike Banach’s limitation.

View Image -

Summary of mathematical notations.

Symbol Description
X State space
U Control/action space
U ( x ) Admissible controls for state xX
M Policy space (set of admissible policies)
μ Policy (μ:XU)
B ( X ) Banach space of bounded value functions
υ ( X ) Space of value functions (XR)
υ ¯ ( X ) Extended value functions (XR¯)
H ( x , u , V ) Cost-to-go function
F μ Policy evaluation operator
F Bellman optimality operator
V * ( x ) Optimal value function
V μ ( x ) Value function for policy μ
ν ( x ) Weighting function for norm definition
· Weighted supremum norm
γ , δ Contraction/modulus parameters
α , β Zamfirescu contraction coefficients
S 1 , S 2 Subsets of B(X) (Section 4.2)

Convergence errors after 10 iterations at x=0.5 (V*0.571).

Approach Contraction Type Error |V10V*|
Zamfirscu Banach-type 0.00014
Zamfirscu Kannan-type 0.00016
Zamfirscu Chatterjea-type 0.00015
Banach Single (α=0.9) 0.031

Appendix A. Analysis of Banach Contraction Failure in Robotic Navigation Example

To clarify why the standard Banach contraction principle fails for the policy evaluation operator Fμ in the robotic navigation example, we provide a rigorous analysis of its behavior across the Banach space B(X)={VB(X):0<V(x)<2} with the supremum norm V=supxX|V(x)|. The operator Fμ is defined piecewise asFμV(x)=x2+0.25+0.5·V(x)4,V(x)S1={VB(X):0V(x)1},x2+0.25+0.5·12V(x)4,V(x)S2={VB(X):1<V(x)2}.

A Banach contraction requires existence of a constant α[0,1) such thatFμVFμVαVVforallV,VB(X).In what follows we demonstrate why this condition fails due to the operator’s discontinuity at V(x)=1.

Appendix A.1. Local Contractivity at the Boundary

We analyze the operator Fμ at x=0.5 for V(0.5)=1S1 and V(0.5)=1+ϵS2:FμV(0.5)=0.52+0.25+0.5·14=0.625,FμV(0.5)=0.52+0.25+0.5·121+ϵ4=0.6250.125ϵ,|FμV(0.5)FμV(0.5)||V(0.5)V(0.5)|=0.125ϵϵ=0.125.This suggests local contractivity with α=0.125 near the boundary V(x)=1 at x=0.5 as the operator contracts the distance between V and V locally.

Appendix A.2. Failure of Uniform Contractivity

Despite local contractivity, the operator fails to satisfy the Banach contraction condition uniformly across B(X) due to its discontinuity at V(x)=1. To illustrate, consider a discretized state space X={x1,,xN} (e.g., N=100) and functions V,VB(X) such that V(xi)=1ϵS1, V(xi)=1+ϵS2 for some xi and V(xj)=V(xj) for ji. The norm is VV=supxX|V(x)V(x)|=|(1ϵ)(1+ϵ)|=2ϵ. ComputeFμV(xi)=xi2+0.25+0.5·1ϵ4=xi2+0.3750.125ϵ,FμV(xi)=xi2+0.25+0.5·121+ϵ4=xi2+0.3750.125(1+ϵ)=xi2+0.250.125ϵ,|FμV(xi)FμV(xi)|=|(xi2+0.3750.125ϵ)(xi2+0.250.125ϵ)|=0.125,FμVFμV=supxX|FμV(x)FμV(x)|=0.125,FμVFμVVV=0.1252ϵasϵ0.This shows that, near the discontinuity at V(x)=1, the contraction factor becomes arbitrarily large, violating the requirement for a uniform α<1. The discontinuity causes the operator to amplify differences across the boundary, preventing a global contraction constant.

References

1. Bellman, R. The theory of dynamic programming. Bull. Am. Math. Soc.; 1954; 60, pp. 503-515. [DOI: https://dx.doi.org/10.1090/S0002-9904-1954-09848-8]

2. Arutyunov, A.; Jaćimović, V.; Pereira, F. Second order necessary conditions for optimal impulsive control problems. J. Dyn. Control Syst.; 2003; 9, pp. 131-153. [DOI: https://dx.doi.org/10.1023/A:1022111402527]

3. Fraga, S.L.; Pereira, F.L. Hamilton-Jacobi-Bellman equation and feedback synthesis for impulsive control. IEEE Trans. Autom. Control; 2011; 57, pp. 244-249. [DOI: https://dx.doi.org/10.1109/TAC.2011.2167822]

4. Chertovskih, R.; Ribeiro, V.M.; Gonçalves, R.; Aguiar, A.P. Sixty Years of the Maximum Principle in Optimal Control: Historical Roots and Content Classification. Symmetry; 2024; 16, 1398. [DOI: https://dx.doi.org/10.3390/sym16101398]

5. Bertsekas, D.; Tsitsiklis, J.N. Neuro-Dynamic Programming; Athena Scientific: Belmont, MA, USA, 1996.

6. Bertsekas, D.P.; Ioffe, S. Temporal Differences-Based Policy Iteration and Applications in Neuro-Dynamic Programming; Laboratory for Information and Decision Systems Report LIDS-P-2349 MIT: Cambridge, MA, USA, 1996; Volume 14.

7. Bertsekas, D.P. Proximal algorithms and temporal difference methods for solving fixed point problems. Comput. Optim. Appl.; 2018; 70, pp. 709-736. [DOI: https://dx.doi.org/10.1007/s10589-018-9990-5]

8. Li, Y.; Johansson, K.H.; Mårtensson, J. Lambda-policy iteration with randomization for contractive models with infinite policies: Well-posedness and convergence. Proceedings of the Learning for Dynamics and Control; Online, 10–11 June 2020; pp. 540-549.

9. Belhenniche, A.; Benahmed, S.; Pereira, F.L. Extension of λ-PIR for weakly contractive operators via fixed point theory. Fixed Point Theory; 2021; 22, pp. 511-526. [DOI: https://dx.doi.org/10.24193/fpt-ro.2021.2.34]

10. Banach, S. On operations in abstract sets and their application to integral equations. Fundam. Math.; 1922; 3, pp. 133-181. [DOI: https://dx.doi.org/10.4064/fm-3-1-133-181]

11. Kannan, R. Some results on fixed points—II. Am. Math. Mon.; 1969; 76, pp. 405-408.

12. Vetro, F. F-Contractions of Hardy–Rogers Type and Application to Multistage Decision. Nonlinear Anal. Model. Control; 2016; 21, pp. 531-546. [DOI: https://dx.doi.org/10.15388/NA.2016.4.7]

13. Chatterjea, S.K. Fixed point theorems for a sequence of mappings with contractive iterates. Publ. l’Institut Mathématique; 1972; 14, pp. 15-18.

14. Zamfirescu, T. Fix point theorems in metric spaces. Arch. Math.; 1972; 23, pp. 292-298. [DOI: https://dx.doi.org/10.1007/BF01304884]

15. Tang, W.; Tran, H.V.; Zhang, Y.P. Policy Iteration for the Deterministic Control Problems: A Viscosity Approach. SIAM J. Control Optim.; 2025; 63, pp. 375-401. [DOI: https://dx.doi.org/10.1137/24M1631602]

16. Kundu, S.; Kunisch, K. Policy iteration for Hamilton–Jacobi–Bellman equations with control constraints. Comput. Optim. Appl.; 2024; 87, pp. 785-809. [DOI: https://dx.doi.org/10.1007/s10589-021-00278-3]

17. Subrahmanyam, P.V. Completeness and fixed-points. Monatshefte Math.; 1975; 80, pp. 325-330. [DOI: https://dx.doi.org/10.1007/BF01472580]

18. Bertsekas, D. Abstract Dynamic Programming; Athena Scientific: Belmont, MA, USA, 2022.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.