1. Introduction

This is the first in a series on Deep Reinforcement Learning (DRL) inspired by recent kitchen counter side discussions with friends. A apt way to introduce the value and power of DRL is to contrast it with self-supervised pre-training in LLMs / DNNs. If we consider games such as chess, while it is possible to pre-train a model once on the rules of the game, by far and away the best way for it to become adept is by doing so empirically – in DRL this is effected by providing an environment where the model may play the game and learn optimal strategies in a phenomenological way. The environment is engineered to deliver a reward signal to the model during intermediate and/or end states of the game. To somewhat simplify the narrative – the magnitude of the reward signal may be then used to construct a loss function that is used to train the model through standard methods from non-convex optimization. The power of Deep RL has been evidenced not just in the area of games (AlpaZero, AlphaGo) and robotics, but also to enhance the reasoning capabilities of frontier LLMs for a number of tasks with automatically verifiable rewards such as math problems and coding. The power of this method is evidenced by a recent article in Nature on outcomes related DeepseekR1 with regards to DRL on Chain of Thought (Cot) fueled reasoning tasks – these outcomes are summed up by the following remark from the article – “The self-evolution of DeepSeek-R1-Zero underscores the power and beauty of RL: rather than explicitly teaching the model how to solve a problem, we simply provide it with the right incentives and it autonomously develops advanced problem-solving strategies. This serves as a reminder of the potential of RL to unlock higher levels of capabilities in LLMs, paving the way for more autonomous and adaptive models in the future” (DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning, Nature Oct. 2025).

To the refugee number theory acolyte, the confluence of the Banach Fixed Point Theorem and the theory of MDP strategies is pedestrian, such statements along with applications of the fundamental theorem of Linear Programming comprise the dual underpinnings of the (de-facto) formal basis for tabular value based methods derived from Bellman Optimality. Such value based frameworks are subsequently used as approximation targets for value and policy based networks in modern reinforcement learning. (cf. later references to DeepSeekR1 and Chain of Thought (CoT) driven reinforcement learning for generalized reasoning in LLMs). This is emphasized as it provides the underlying formal abstractions from the underlying two theorems that are never really placed in the front and center in most modern narratives on reinforcement learning.

Reinforcement Learning (RL) begins with a deceptively simple question: given that I am in some state at some point in time, what is the best strategy for maximizing present and future cumulative discounted rewards that can be attained over all possible present and future actions from the given initial state? The answer is crystallized in Richard Bellman’s principle of optimality seems states that the search over an infinite continuum of stochastic policies collapses, and there always exists a deterministic policy that is optimal – this result arises by application of the fundamental theorem of linear programming. This article unpacks how and why such a “Stochastic Collapse” (self-termed) occurs. It then describes the algorithmic structure that lets us compute the answer by working backward from the terminal time (Bellman backup and dynamic programming), and why such exact methods shatter against the scale of real-world problems—from autonomous vehicles to the reasoning chains inside large language models like DeepSeek-R1. For refugees more at home in number theory or cryptography than in control theory, the basic intuition behind behind the stochastic collapse is this: optimizing a value function over all stochastic policies is, on its face, an optimization over an uncountably infinite set of probability distributions—a simplex at every state and every time step. The key insight is that the structure of the Bellman equation forces the optimum to a vertex of each simplex, and vertices of probability simplices are precisely the deterministic (pure) actions. In a related vein, the parallel to Nash’s theorem on equilibria in non-cooperative games is not accidental; both rely on the geometry of simplices and fixed-point arguments, though the mechanisms diverge in important ways we will discuss.

2. Setup: Finite-Horizon MDPs, Policies, and Value Functions

2.1 The Finite-Horizon MDP

Definition (Finite-Horizon MDP). A finite-horizon Markov Decision Process is a tuple \langle \mathcal{S}, \mathcal{A}, P, R, H, \gamma \rangle where:

  • \mathcal{S} is a finite set of states,
  • \mathcal{A} is a finite set of actions,
  • P_t(s' \mid s, a) is the transition probability from state s to s' under action a at time step t \in \{0, 1, \ldots, H-1\},
  • R_t(s,a) is the immediate reward for taking action a in state s at time step t,
  • H \in \mathbb{N} is the horizon—the number of decision epochs, and
  • \gamma \in (0, 1] is a discount factor.

The agent interacts with the MDP for exactly H steps: at each time t = 0, 1, \ldots, H-1, it observes state s_t \in \mathcal{S}, selects action a_t \in \mathcal{A}, receives reward R_t(s_t, a_t), and transitions to s_{t+1} \sim P_t(\cdot \mid s_t, a_t). The interaction terminates after step H-1Remark (Discount factor). In the finite-horizon setting, the discount factor \gamma may be set to 1 without convergence issues, since the sum of rewards is bounded by at most H terms. When \gamma = 1, the agent treats future rewards as equally valuable as immediate ones. We retain \gamma in the formulation for generality and to clarify the connection to the infinite-horizon discounted setting discussed in Section 5. Remark (Time-varying dynamics). We allow the transition kernel P_t and reward function R_t to depend on t. This is the most general finite-horizon formulation. When P_t = P and R_t = R for all t (time-homogeneous dynamics), the model simplifies, but the optimal policy may still be non-stationary because the time remaining influences the optimal action.

2.2 Non-Stationary Stochastic Policies

Definition (Non-stationary stochastic policy). A non-stationary stochastic policy is a sequence \pi = (\pi_0, \pi_1, \ldots, \pi_{H-1}) where each \pi_t : \mathcal{S} \to \Delta(\mathcal{A}) maps states to probability distributions over actions at time t. Here \Delta(\mathcal{A}) denotes the (|\mathcal{A}|-1)-dimensional probability simplex: \displaystyle \Delta(\mathcal{A}) = \left\{ \mathbf{p} \in \mathbb{R}^{|\mathcal{A}|} \ \Big|\ p_a \geq 0 \ \forall\, a \in \mathcal{A}, \quad \sum_{a \in \mathcal{A}} p_a = 1 \right\} Definition (Non-stationary deterministic policy). A policy \pi is deterministic if, for every t and s, \pi_t(s) places all probability mass on a single action—equivalently, \pi_t(s) is a vertex of \Delta(\mathcal{A}). The space of all non-stationary stochastic policies is the Cartesian product \displaystyle \Pi_{\mathrm{all}} = \prod_{t=0}^{H-1} \prod_{s \in \mathcal{S}} \Delta(\mathcal{A}), a continuous, uncountably infinite set of dimension H \cdot |\mathcal{S}| \cdot (|\mathcal{A}|-1).

2.3 The Gridworld Example

Example. Consider a gridworld state s with four actions: Up, Down, Left, Right. At time step t, a stochastic policy might assign \displaystyle \pi_t(s) = (0,\; 0.5,\; 0,\; 0.5), meaning: never go Up or Left, and flip a fair coin to decide between Down and Right. The space of all possible policies at this single state and time step is the 3-simplex—the tetrahedron in \mathbb{R}^4 whose vertices are the four pure actions (1,0,0,0), (0,1,0,0), (0,0,1,0), and (0,0,0,1). At a different time step t', the policy at the same state s may assign an entirely different distribution—this is the non-stationary nature of finite-horizon policies. The number of remaining steps influences the optimal action: with many steps remaining, an exploratory detour may be worthwhile; with one step left, only the immediate reward matters.

2.4 The State-Value Function

Under a fixed non-stationary policy \pi, the value of state s at time t is the expected discounted cumulative reward from s over the remaining horizon: \displaystyle V_t^\pi(s) = \mathbb{E}_\pi\!\left[ \sum_{k=0}^{H-1-t} \gamma^k \, R_{t+k}(s_{t+k}, a_{t+k}) \ \Big|\ s_t = s \right] At the terminal time t = H, no further decisions are made, so the boundary condition is: \displaystyle V_H^\pi(s) = 0 \qquad \forall\, s \in \mathcal{S} The optimal value function at time t is defined by optimizing over all non-stationary stochastic policies: \displaystyle V_t^*(s) = \sup_{\pi \in \Pi_{\mathrm{all}}} V_t^\pi(s) On its face, this is an optimization over the product of simplices—a space with H \cdot |\mathcal{S}| \cdot (|\mathcal{A}|-1) continuous dimensions. We now show that this optimization collapses dramatically.

3. The Stochastic Collapse: Why Deterministic Policies Suffice

3.1 The Bellman Optimality Equation (Finite Horizon)

The key structural result is the Bellman optimality equation for the finite-horizon setting. With the boundary condition V_H^*(s) = 0 for all s, the optimal value function satisfies, for t = H-1, H-2, \ldots, 0: \displaystyle V_t^*(s) = \max_{a \in \mathcal{A}} \left[ R_t(s,a) + \gamma \sum_{s' \in \mathcal{S}} P_t(s' \mid s, a)\, V_{t+1}^*(s') \right] Notice the operator is \max over the finite action set, not an optimization over probability distributions. This is not assumed—it is derived. We now show why.

3.2 The Linearity Argument

Define the action-value function (or Q-function) at time t under optimality: \displaystyle Q_t^*(s,a) = R_t(s,a) + \gamma \sum_{s' \in \mathcal{S}} P_t(s' \mid s, a)\, V_{t+1}^*(s') Under a stochastic policy \pi_t(\cdot \mid s) \in \Delta(\mathcal{A}), the value at state s and time t is the expectation over actions: \displaystyle V_t^{\pi}(s) = \sum_{a \in \mathcal{A}} \pi_t(a \mid s)\, Q_t^*(s,a) This is a linear function of the policy weights \pi_t(\cdot \mid s) \in \Delta(\mathcal{A}). We are maximizing a linear functional over a convex polytope (the simplex). By a fundamental result in linear programming—indeed, by the same geometric argument that underlies the simplex method: Proposition (Vertex optimality on the simplex). Let f(\mathbf{p}) = \sum_{a} p_a c_a be a linear function on the simplex \Delta(\mathcal{A}) with c_a \in \mathbb{R}. Then \displaystyle \max_{\mathbf{p} \in \Delta(\mathcal{A})} f(\mathbf{p}) = \max_{a \in \mathcal{A}} \; c_a, and the maximum is attained at the vertex \mathbf{e}_{a^*} where a^* = \arg\max_{a} c_aProof. Since \mathbf{p} \in \Delta(\mathcal{A}), we have p_a \geq 0 and \sum_a p_a = 1. Let a^* = \arg\max_a c_a. Then \displaystyle \sum_a p_a c_a \leq \sum_a p_a c_{a^*} = c_{a^*} \cdot \sum_a p_a = c_{a^*}, with equality when p_{a^*} = 1. \square Applying the Proposition with c_a = Q_t^*(s,a): \displaystyle \max_{\pi_t(\cdot|s) \in \Delta(\mathcal{A})} \sum_a \pi_t(a|s)\, Q_t^*(s,a) = \max_{a \in \mathcal{A}}\; Q_t^*(s,a) This holds at every state s and every time step t, independently. Therefore the optimal policy is deterministic and non-stationary: \displaystyle \pi_t^*(s) = \arg\max_{a \in \mathcal{A}}\; Q_t^*(s,a), \qquad t = 0, 1, \ldots, H-1

3.3 The Collapse Summarized

Theorem (Deterministic optimality in finite-horizon MDPs). For any finite-horizon MDP \langle \mathcal{S}, \mathcal{A}, P, R, H, \gamma \rangle with \gamma \in (0,1], there exists a deterministic non-stationary policy \pi^* = (\pi_0^*, \pi_1^*, \ldots, \pi_{H-1}^*) that is optimal: \displaystyle V_t^{\pi^*}(s) = V_t^*(s) = \sup_{\pi \in \Pi_{\mathrm{all}}} V_t^\pi(s) \qquad \forall\, s \in \mathcal{S}, \quad t = 0, 1, \ldots, H Moreover, \pi_t^*(s) = \arg\max_{a \in \mathcal{A}} Q_t^*(s,a) for each t and s. This is the “stochastic collapse”: the optimization over \Pi_{\mathrm{all}}—an uncountable product of simplices with H \cdot |\mathcal{S}| \cdot (|\mathcal{A}|-1) continuous dimensions—reduces to H \cdot |\mathcal{S}| independent \arg\max operations, each over the finite set \mathcal{A}.

3.4 Computational Significance

Without the collapse, finding an optimal policy would require searching over the continuous manifold \Pi_{\mathrm{all}}. With the collapse, the search space reduces to |\mathcal{A}|^{H \cdot |\mathcal{S}|} deterministic non-stationary policies—a finite (though exponentially large) set. More importantly, as we show in the next section, backward induction avoids enumerating even this finite set.

4. Computing the Optimum: Backward Induction

The Bellman optimality equation, combined with the boundary condition V_H^*(s) = 0, yields a backward induction algorithm that computes V_t^* and \pi_t^* exactly. Algorithm: Backward Induction (Finite-Horizon Value Iteration)

  1. Initialize: Set V_H(s) = 0 for all s \in \mathcal{S}.
  2. For t = H-1, H-2, \ldots, 0:
    1. For each s \in \mathcal{S} and a \in \mathcal{A}, compute: \displaystyle Q_t(s,a) = R_t(s,a) + \gamma \sum_{s' \in \mathcal{S}} P_t(s' \mid s,a)\, V_{t+1}(s')
    2. For each s \in \mathcal{S}, set: \displaystyle V_t(s) = \max_{a \in \mathcal{A}}\, Q_t(s,a), \qquad \pi_t^*(s) = \arg\max_{a \in \mathcal{A}}\, Q_t(s,a)
  3. Return \{V_t^*\}_{t=0}^{H} and \{\pi_t^*\}_{t=0}^{H-1}.

Theorem (Correctness of backward induction). Backward induction computes the optimal value function V_t^* and an optimal deterministic non-stationary policy \pi^* for any finite-horizon MDP. No initialization choice is required beyond the boundary condition V_H^* = 0, which is given by the problem definition. Proof. By (backward) induction on t. The base case t = H is the boundary condition. For the inductive step, assume V_{t+1}^* has been computed correctly. Then Q_t^*(s,a) is correct, and by the Deterministic Optimality Theorem—itself a consequence of the Vertex Optimality Proposition—the \max over actions yields V_t^*(s). \square Complexity. Each time step requires computing Q_t(s,a) for all |\mathcal{S}| \cdot |\mathcal{A}| state-action pairs, and each Q_t(s,a) involves a sum over |\mathcal{S}| successor states. The total complexity is: \displaystyle O\bigl(H \cdot |\mathcal{S}|^2 \cdot |\mathcal{A}|\bigr) This is polynomial in the tabular quantities H, |\mathcal{S}|, and |\mathcal{A}|—a dramatic improvement over enumerating |\mathcal{A}|^{H \cdot |\mathcal{S}|} policies.

5. Connection to the Infinite-Horizon Discounted Setting

A parallel and widely-studied formulation is the infinite-horizon discounted MDP \langle \mathcal{S}, \mathcal{A}, P, R, \gamma \rangle with \gamma \in [0,1), time-homogeneous dynamics, and no terminal time. Here H is absent from the tuple; instead, the strict discount \gamma < 1 ensures that the infinite sum of rewards converges. The value function under a stationary policy \pi: \mathcal{S} \to \Delta(\mathcal{A}) is: \displaystyle V^\pi(s) = \mathbb{E}_\pi\!\left[ \sum_{t=0}^{\infty} \gamma^t R(s_t, a_t) \;\Big|\; s_0 = s \right] The Bellman optimality equation in this setting is: \displaystyle V^*(s) = \max_{a \in \mathcal{A}}\left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a)\, V^*(s') \right] and the same linearity argument (Vertex Optimality Proposition) guarantees the existence of a deterministic stationary optimal policy.

5.1 Value Iteration and the Banach Contraction Mapping Theorem

In the infinite-horizon setting, there is no terminal time from which to work backward. Instead, one defines the Bellman optimality operator T^* acting on value functions: \displaystyle (T^* V)(s) = \max_{a \in \mathcal{A}}\left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a)\, V(s') \right] and applies it iteratively: V_{k+1} = T^* V_k, starting from an arbitrary initialization V_0. Convergence is guaranteed by the following classical result. Theorem (Banach Fixed-Point Theorem). Let (X, d) be a complete metric space and T: X \to X a contraction mapping, i.e., there exists \gamma \in [0,1) such that \displaystyle d\bigl(T(x),\, T(y)\bigr) \leq \gamma\, d(x,y) \qquad \forall\, x, y \in X Then: (i) T has a unique fixed point x^* \in X; (ii) for any x_0 \in X, the sequence x_{k+1} = T(x_k) converges to x^*; (iii) the rate of convergence satisfies \displaystyle d(x_k, x^*) \leq \frac{\gamma^k}{1-\gamma}\, d(x_1, x_0) The space of bounded value functions V: \mathcal{S} \to \mathbb{R} equipped with the supremum norm \|V\|_\infty = \max_s |V(s)| forms a complete metric space (a finite-dimensional Banach space). The operator T^* is a \gamma-contraction under this norm: \displaystyle \|T^* V - T^* U\|_\infty \leq \gamma\, \|V - U\|_\infty which follows from the non-expansiveness of the \max operator and the \gamma-scaling of the future terms. By the Banach theorem: (i) V^* is the unique fixed point of T^*; (ii) value iteration converges to V^* from any initialization V_0; (iii) convergence is geometric with rate \gamma, yielding an \epsilon-approximate solution in O\!\bigl(\frac{1}{1-\gamma}\log\frac{1}{\epsilon(1-\gamma)}\bigr) iterations. Each iteration has complexity O(|\mathcal{S}|^2 |\mathcal{A}|), giving a total complexity of: \displaystyle O\!\left( \frac{|\mathcal{S}|^2\,|\mathcal{A}|}{1-\gamma}\, \log\frac{1}{\epsilon(1-\gamma)} \right) Remark (Finite vs. infinite horizon). The two settings are related but structurally distinct. In the finite-horizon case, convergence is trivial—backward induction terminates in exactly H steps from the known boundary V_H^* = 0. In the infinite-horizon case, there is no boundary condition; convergence must be proved, and the Banach theorem provides this guarantee. One can embed a finite-horizon problem into an infinite-horizon one by augmenting the state space with a time index and introducing an absorbing terminal state, but this is an encoding device rather than a natural subsumption. An alternative exact method for the infinite-horizon setting, Policy Iteration (Howard, 1960), alternates between policy evaluation (solving a |\mathcal{S}| \times |\mathcal{S}| linear system) and greedy policy improvement. It converges in finitely many iterations (at most |\mathcal{A}|^{|\mathcal{S}|}, though typically far fewer), with each iteration costing O(|\mathcal{S}|^3) for the linear solve.

6. Connection to Game-Theoretic Equilibria

The vertex-of-the-simplex phenomenon has a notable parallel in non-cooperative game theory. Nash’s Existence Theorem guarantees the existence of mixed-strategy equilibria in finite games via Kakutani’s fixed-point theorem applied to best-response correspondences over strategy simplices. In certain structured games—including two-player zero-sum games (the minimax theorem, von Neumann 1928)—the equilibrium may be achieved in pure strategies. The analogy is instructive but the mechanisms differ. In single-agent MDPs (both finite and infinite horizon), the optimality of deterministic policies follows from the linearity of the objective in the policy at each state and time step, combined with the simplex geometry. It is a consequence of LP duality, not a fixed-point existence argument. In multi-agent games, pure-strategy equilibria are not guaranteed in general. The minimax theorem for zero-sum games relies on saddle-point structure, and Nash equilibria in general games may require genuinely mixed strategies. The single-agent MDP is thus a more favorable setting: the linearity structure is stronger than what multi-agent interactions provide, and deterministic optima are guaranteed rather than merely possible.

7. The Curse of Dimensionality: Why Exact Methods Fail at Scale

Both backward induction and infinite-horizon value iteration are tabular methods: they maintain an explicit table of values for every state (and possibly every state-action pair) at every time step. Their per-iteration complexity is polynomial in |\mathcal{S}|—but |\mathcal{S}| itself is the problem. In real-world applications, the state space grows combinatorially or continuously: Autonomous driving. The state includes the ego vehicle’s position, velocity, and heading; the positions and velocities of dozens of surrounding agents; lane geometry; traffic signal states; weather conditions; and sensor uncertainty. Even a coarse discretization yields state spaces of 10^{20} or more. The horizon H can span hundreds of decision steps at 10 Hz control frequency. Storing a single floating-point value per state per time step is physically impossible, let alone iterating over them. The complexity bound O(H \cdot |\mathcal{S}|^2 \cdot |\mathcal{A}|) becomes vacuous. LLM reasoning (DeepSeek-R1 and GRPO). DeepSeek-R1 uses reinforcement learning—specifically Group Relative Policy Optimization—to train a large language model to perform multi-step reasoning. Here, a “state” at time t is the entire sequence of tokens generated so far (the partial chain of thought), and an “action” is the next token from a vocabulary of ~100,000. The state space is the set of all possible token sequences up to the context length—a number that dwarfs any astronomical quantity. The horizon H is the maximum generation length (often thousands of tokens). The value function (estimating the quality of a partial reasoning chain) cannot be tabulated; it must be approximated, which is exactly what the neural network serves as. In this setting, the non-stationary nature of the finite-horizon formulation is particularly natural: the quality of a token choice depends critically on how many reasoning steps remain.

7.1 The Approximation Paradigm

The modern response to the curse of dimensionality is function approximation: rather than storing V_t(s) or Q_t(s,a) in a table, represent them with a parameterized function—a neural network, a linear combination of features, or another compact representation: \displaystyle V_\theta(s, t) \approx V_t^*(s) This introduces both power and peril. Deep Q-Networks (DQN, Mnih et al., 2015) demonstrated that neural networks could approximate Q^* well enough to achieve superhuman Atari play directly from pixels. Proximal Policy Optimization (PPO, Schulman et al., 2017) and its descendants optimize parameterized stochastic policies directly via gradient ascent, sidestepping the need to ever represent V^* or Q^* in full. The clean convergence guarantees of backward induction (finite horizon) and the Banach theorem (infinite horizon) no longer apply in their original form—function approximation can cause divergence (the “deadly triad” of off-policy learning, function approximation, and bootstrapping). Ensuring stability in this regime remains one of the central open problems in RL theory.

8. Conclusion

The essence of Bellman optimality is the dramatic simplification hidden inside an apparently intractable optimization problem. The scaling limitations of dynamic programming based methods motivate the need for alternatives based on approximate rather than exact solutions – follow on articles in this series will introduce approximate methods starting with Deep Q Learning and then move over to a discussion of policy networks starting with the so called “Reinforce Trick” and culminating in a description of the PPO Surrogate Objective function.