Neptune Blog

Markov Decision Process in Reinforcement Learning: Everything You Need to Know

Andre Ye

7 min

13th September, 2024

ML Model Development Reinforcement Learning

Take a moment to locate the nearest big city around you. If you were to go there, how would you do it? Go by car, take a bus, take a train? Maybe ride a bike, or buy an airplane ticket?

Making this choice, you incorporate probability into your decision-making process. Perhaps there’s a 70% chance of rain or a car crash, which can cause traffic jams. If your bike tire is old, it may break down – this is certainly a large probabilistic factor.

On the other hand, there are deterministic costs – for instance, the cost of gas or an airplane ticket – as well as deterministic rewards – like much faster travel times taking an airplane.

These types of problems – in which an agent must balance probabilistic and deterministic rewards and costs – are common in decision-making. Markov Decision Processes are used to model these types of optimization problems, and can also be applied to more complex tasks in Reinforcement Learning.

Defining Markov Decision Processes in Machine Learning

To illustrate a Markov Decision process, think about a dice game:

Each round, you can either continue or quit.
If you quit, you receive $5 and the game ends.
If you continue, you receive $3 and roll a 6-sided die. If the die comes up as 1 or 2, the game ends. Otherwise, the game continues onto the next round.

There is a clear trade-off here. For one, we can trade a deterministic gain of $2 for the chance to roll dice and continue to the next round.

To create an MDP to model this game, first we need to define a few things:

A state is a status that the agent (decision-maker) can hold. In the dice game, the agent can either be in the game or out of the game.
An action is a movement the agent can choose. It moves the agent between states, with certain penalties or rewards.
Transition probabilities describe the probability of ending up in a state s’ (s prime) given an action a. These will be often denoted as a function P(s, a, s’) that outputs the probability of ending up in s’ given current state s and action a.
For example, P(s=playing the game, a=choose to continue playing, s’=not playing the game) is ⅓, since there is a two-sixths (one-third) chance of losing the dice roll.
Rewards are given depending on the action. The reward for continuing the game is 3, whereas the reward for quitting is $5. The ‘overall’ reward is to be optimized.

We can formally describe a Markov Decision Process as m = (S, A, P, R, gamma), where:

S represents the set of all states.
A represents the set of possible actions.
P represents the transition probabilities.
R represents the rewards.
Gamma is known as the discount factor (more on this later).

The goal of the MDP m is to find a policy, often denoted as pi, that yields the optimal long-term reward. Policies are simply a mapping of each state s to a distribution of actions a. For each state s, the agent should take action a with a certain probability. Alternatively, policies can also be deterministic (i.e. the agent will take action a in state s).

Our Markov Decision Process would look like the graph below. An agent traverses the graph’s two states by making decisions and following probabilities.

It’s important to mention the Markov Property, which applies not only to Markov Decision Processes but anything Markov-related (like a Markov Chain).

It states that the next state can be determined solely by the current state – no ‘memory’ is necessary. This applies to how the agent traverses the Markov Decision Process, but note that optimization methods use previous learning to fine tune policies. This is not a violation of the Markov property, which only applies to the traversal of an MDP.

The Bellman equation & dynamic programming

The Bellman Equation is central to Markov Decision Processes. It outlines a framework for determining the optimal expected reward at a state s by answering the question: “what is the maximum reward an agent can receive if they make the optimal action now and for all future decisions?”

Although versions of the Bellman Equation can become fairly complicated, fundamentally most of them can be boiled down to this form:

It is a relatively common-sense idea, put into formulaic terms. Notice the role gamma – which is between 0 or 1 (inclusive) – plays in determining the optimal reward. If gamma is set to 0, the V(s’) term is completely canceled out and the model only cares about the immediate reward.

On the other hand, if gamma is set to 1, the model weights potential future rewards just as much as it weights immediate rewards. The optimal value of gamma is usually somewhere between 0 and 1, such that the value of farther-out rewards has diminishing effects.

Let’s use the Bellman equation to determine how much money we could receive in the dice game. We can choose between two choices, so our expanded equation will look like max(choice 1’s reward, choice 2’s reward).

Choice 1 – quitting – yields a reward of 5.

On the other hand, choice 2 yields a reward of 3, plus a two-thirds chance of continuing to the next stage, in which the decision can be made again (we are calculating by expected return). We add a discount factor gamma in front of terms indicating the calculating of s’ (the next state).

This equation is recursive, but inevitably it will converge to one value, given that the value of the next iteration decreases by ⅔, even with a maximum gamma of 1.

At some point, it will not be profitable to continue staying in game. Let’s calculate four iterations of this, with a gamma of 1 to keep things simple and to calculate the total long-term optimal reward.

At each step, we can either quit and receive an extra $5 in expected value, or stay and receive an extra $3 in expected value. Each new round, the expected value is multiplied by two-thirds, since there is a two-thirds probability of continuing, even if the agent chooses to stay.

Here, the decimal values are computed, and we find that (with our current number of iterations) we can expect to get $7.8 if we follow the best choices.

Here, we calculated the best profit manually, which means there was an error in our calculation: we terminated our calculations after only four rounds.

If we were to continue computing expected values for several dozen more rows, we would find that the optimal value is actually higher. In order to compute this efficiently with a program, you would need to use a specialized data structure.

Plus, in order to be efficient, we don’t want to calculate each expected value independently, but in relation with previous ones. The solution: Dynamic Programming.

Richard Bellman, of the Bellman Equation, coined the term Dynamic Programming, and it’s used to compute problems that can be broken down into subproblems. For example, the expected value for choosing Stay > Stay > Stay > Quit can be found by calculating the value of Stay > Stay > Stay first.

These pre-computations would be stored in a two-dimensional array, where the row represents either the state [In] or [Out], and the column represents the iteration. We can write rules that relate each cell in the table to a previously precomputed cell (this diagram doesn’t include gamma).

Then, the solution is simply the largest value in the array after computing enough iterations. Through dynamic programming, computing the expected value – a key component of Markov Decision Processes and methods like Q-Learning – becomes efficient.

Q-learning: Markov Decision Process + Reinforcement Learning

Let’s think about a different simple game, in which the agent (the circle) must navigate a grid in order to maximize the rewards for a given number of iterations.

There are seven types of blocks:

-2 punishment,
-5 punishment,
-1 punishment,
+1 reward,
+10 reward,
block that moves the agent to space A1 or B3 with equal probability,
empty blocks.

Note that this is an MDP in grid form – there are 9 states and each connects to the state around it. The game terminates if the agent has a punishment of -5 or less, or if the agent has reward of 5 or more.

In Q-learning, we don’t know about probabilities – it isn’t explicitly defined in the model. Instead, the model must learn this and the landscape by itself by interacting with the environment.

This makes Q-learning suitable in scenarios where explicit probabilities and values are unknown. If they are known, then you might not need to use Q-learning.

In our game, we know the probabilities, rewards, and penalties because we are strictly defining them. But if, say, we are training a robot to navigate a complex landscape, we wouldn’t be able to hard-code the rules of physics; using Q-learning or another reinforcement learning method would be appropriate.

Each step of the way, the model will update its learnings in a Q-table. The table below, which stores possible state-action pairs, reflects current known information about the system, which will be used to drive future decisions.

Each of the cells contain Q-values, which represent the expected value of the system given the current action is taken. (Does this sound familiar? It should – this is the Bellman Equation again!)

All values in the table begin at 0 and are updated iteratively. Note that there is no state for A3 because the agent cannot control their movement from that point.

To update the Q-table, the agent begins by choosing an action. It cannot move up or down, but if it moves right, it suffers a penalty of -5, and the game terminates. The Q-table can be updated accordingly.

When the agent traverses the environment for the second time, it considers its options. Given the current Q-table, it can either move right or down. Moving right yields a loss of -5, compared to moving down, currently set at 0.

For the sake of simulation, let’s imagine that the agent travels along the path indicated below, and ends up at C1, terminating the game with a reward of 10. We can then fill in the reward that the agent received for each action they took along the way.

Obviously, this Q-table is incomplete. Even if the agent moves down from A1 to A2, there is no guarantee that it will receive a reward of 10. After enough iterations, the agent should have traversed the environment to the point where values in the Q-table tell us the best and worst decisions to make at every location.

This example is a simplification of how Q-values are actually updated, which involves the Bellman Equation discussed above. For instance, depending on the value of gamma, we may decide that recent information collected by the agent, based on a more recent and accurate Q-table, may be more important than old information, so we can discount the importance of older information in constructing our Q-table.

It’s important to note the exploration vs exploitation trade-off here. If the agent traverses the correct path towards the goal but ends up, for some reason, at an unlucky penalty, it will record that negative value in the Q-table and associate every move it took with this penalty.

If the agent is purely ‘exploitative’ – it always seeks to maximize direct immediate gain – it may never dare to take a step in the direction of that path.

Alternatively, if an agent follows the path to a small reward, a purely exploitative agent will simply follow that path every time and ignore any other path, since it leads to a reward that is larger than 1.

By allowing the agent to ‘explore’ more, it can focus less on choosing the optimal path to take and more on collecting information. This usually happens in the form of randomness, which allows the agent to have some sort of randomness in their decision process.

However, a purely ‘explorative’ agent is also useless and inefficient – it will take paths that clearly lead to large penalties and can take up valuable computing time.

It’s good practice to incorporate some intermediate mix of randomness, such that the agent bases its reasoning on previous discoveries, but still has opportunities to address less explored paths.

A sophisticated form of incorporating the exploration-exploitation trade-off is simulated annealing, which comes from metallurgy, the controlled heating and cooling of metals.

Instead of allowing the model to have some sort of fixed constant in choosing how explorative or exploitative it is, simulated annealing begins by having the agent heavily explore, then become more exploitative over time as it gets more information.

This method has shown enormous success in discrete problems like the Travelling Salesman Problem, so it also applies well to Markov Decision Processes.

Because simulated annealing begins with high exploration, it is able to generally gauge which solutions are promising and which are less so. As the model becomes more exploitative, it directs its attention towards the promising solution, eventually closing in on the most promising solution in a computationally efficient way.

Summary

Let’s wrap up what we explored in this article:

A Markov Decision Process (MDP) is used to model decisions that can have both probabilistic and deterministic rewards and punishments.

MDPs have five core elements:

S, a set of possible states for an agent to be in,
A, a set of possible actions an agent can take at a particular state,
R, the rewards for making an action A at state S;
P, the probabilities for transitioning to a new state S’ after taking action A at original state S;
gamma, which controls how far-looking the Markov Decision Process agent will be.

All Markov Processes, including MDPs, must follow the Markov Property, which states that the next state can be determined purely by the current state.

The Bellman Equation determines the maximum reward an agent can receive if they make the optimal decision at the current state and at all following states. It defines the value of the current state recursively as being the maximum possible value of the current state reward, plus the value of the next state.

Dynamic programming utilizes a grid structure to store previously computed values and builds upon them to compute new values. It can be used to efficiently calculate the value of a policy and to solve not only Markov Decision Processes, but many other recursive problems.

Q-Learning is the learning of Q-values in an environment, which often resembles a Markov Decision Process. It is suitable in cases where the specific probabilities, rewards, and penalties are not completely known, as the agent traverses the environment repeatedly to learn the best strategy by itself.

Hope you enjoyed exploring these topics with me. Thank you for reading!

Was the article useful?

More about Markov Decision Process in Reinforcement Learning: Everything You Need to Know

Check out our product resources and related articles below:

Reinforcement Learning From Human Feedback (RLHF) For LLMs

The Best Tools for Reinforcement Learning in Python You Actually Want to Try

LLMOps: What It Is, Why It Matters, and How to Implement It

LLMs For Structured Data

Explore more content topics:

Computer Vision General LLMOps ML Model Development ML Tools MLOps Natural Language Processing Paper Reflections Reinforcement Learning Tabular Data Time Series

Neptune is the experiment tracker purpose-built for foundation model training.

It lets you monitor and visualize thousands of per-layer metrics—losses, gradients, and activations—at any scale. Drill down into logs and debug training issues fast. Keep your model training stable while reducing wasted GPU cycles.

Play with a live project

See Docs

Transition Hub

Train FM

State of Foundation Model Training Report 2025