MLOps Blog

Reinforcement Learning Basics With Examples (Markov Chain and Tree Search)

12 min
10th August, 2023

Have you ever played against the computer in a video game, and wondered how it gets so good? Well, a big part of it is reinforcement learning.

Reinforcement Learning (RL) is a machine learning domain that focuses on building self-improving systems that learn for their own actions and experiences in an interactive environment. In RL, the system (learner) will learn what to do and how to do based on rewards. Unlike other machine learning algorithms, we don’t tell the system what to do. It autonomously explores and discovers which action can yield the most rewards. Reinforcement problems are considered a closed-loop because the system’s present actions will influence its later inputs. 

“Reinforcement Learning, in the context of machine learning and artificial intelligence, is a type of dynamic programming that trains algorithms using a system of reward and punishment.” – Techopedia

Read also

How to Structure, Organize, Track and Manage Reinforcement Learning (RL) Projects

In this article, we’re going to explore reinforcement learning in-depth along with some practical examples.

How does Reinforcement Learning work?

Reinforcement learning problems can be formulated with a sequence of different elements depending on the technique you’re using. A basic reinforcement learning process includes an agent, action taken by an agent in an interactive environment, and a reward based on the action. 

  • Agent – learner who takes decisions based on previously earned rewards.
  • Action – the step an agent takes in order to gain a reward.
  • Environment – a task which an agent needs to explore in order to get rewards.
  • State – in an environment, the state is a situation or position where an agent is present. The present state contains information about the previous state of the agent which helps them with the next course of action.
  • Rewards – an Agent receives rewards or punishments in response to the actions performed. 

In the above diagram, the subscripts t and t+1 denote the time steps. The agents interact with an environment in time steps, which get incremented as agents move to a new state:

st=s0 to st+1= s0+1=s1 

There’s no training phase in reinforcement learning, the learner or agent improves based on their own experiences, and receives rewards or punishments for positive and negative behaviors respectively. This whole experience is captured in state-action pairs that occur one after another. The ultimate goal of this process is to maximize overall reward while learning the best action or move. 

Though the goal is clear, there are still some parameters that reinforcement learning involves to achieve better performance.

  • Discount factor – helps to adjust the importance of rewards over time. It exponentially decreases the value of later rewards so agents don’t take any action with no long-term repercussions.  
  • Policy – indicates the mapping of states to actions. The goal is to find an optimal policy that specifies the action for each state, which promises the highest reward.
  • Value – determines how good a state-action pair is, i.e. this function attempts to find a policy that can help maximize the returns.
  • Q-value –  maps state-action pairs to rewards. This refers to the long-term impact of an action taken under a policy in a certain state.

Some of these functions are basic for all the RL algorithms, and some are specific to a few algorithms. In the next section, we’ll walk through RL algorithms and explain these functions in detail. 

Reinforcement learning algorithms are goal-oriented and learn how to achieve a complex goal. Basic algorithms might not be efficient for larger state-action spaces unless they’re incorporated with neural networks. 

Neural Networks consist of nodes that are interconnected, arranged in layers, and each connection between them will have some weights assigned to it. The objective is to learn about correct weights through several iterations of backward and forward propagation of training data. One might question how this can be applied to reinforcement learning algorithms.

May interest you

Guide to Building Your Own Neural Network
Graph Neural Network and Some of GNN Applications

In reinforcement learning there can be N number of combinations for state and actions, keeping records of all these combinations can be a bit difficult. A neural network can be trained on some state-action samples and learn to predict the best action possible for each state.

Now that we have a basic understanding of Reinforcement Learning, let’s review some of the algorithms and techniques.

Reinforcement Learning algorithms

We’ll only be discussing a few handpicked algorithms to get you started with Reinforcement Learning, as there are many of them and new ones keep being discovered.

RL algorithms can be categorized into:

  • Model-Based RL,
  • Model-Free RL. 

Model-free RL can be further divided into policy- and value-based RL.

  • Model-Based – These RL algorithms use models to learn the transition distribution based on previously observed transitions. Modeling is prone to be biased and the same can happen in Model-Based RL. The policy optimization or finding a correct state-action pair to get high rewards can face bias in case of insufficient data.
  • Policy-Based – In policy-based algorithms, the focus is on developing an optimal policy for the state-action pairs. There are two types of policies, deterministic and stochastic. 
  • Value-Based – A value-based algorithm determines an optimal policy to maximize the expected reward value over any and all successive steps, starting from the current state.

Let’s check out some of the most popular RL algorithms.


Q Learning is a model-free value-based Reinforcement Algorithm. The focus is on learning the value of an action in a particular state. Two main components help in finding correct action for a given state:

  1. Q-Table – This is a fancy name for a look-up table. The table contains the Q-score, the maximum expected future reward that the agent will get if it takes a specified action. Each row signifies a particular state in the environment and each column is dedicated to the actions. 
RL Q table

The table values will be updated iteratively with improved actions/scores. These values can be calculated using Q-function.

  1. Q-Function – Q-function uses Bellman’s equation. “Bellman equation, named after Richard E. Bellman, is a necessary condition for optimality associated with the mathematical optimization method known as dynamic programming.

Q-function uses the basic concept of the Bellman equation. It calculates the values for a decision problem at particular points by using the values from the previous states. 

Q (st,at) = r(s,a) + max q (st,at)

In the above equation,

Q (st,at) = Q- value of the action given in a particular state 

r(s,a) = Reward for taking that action in a given state

= Discount factor

max q (st,at) = Maximum expected reward in a given state and all the possible actions at that state

Q in this equation represents the quality of the actions at each state, which is why in this algorithm a pair of state and action is being used. Q-value helps to identify which actions are more suitable than others in a state. This is where Bellman comes into the picture as we can leverage Bellman’s theory to identify the perfect Q-value.

Initially, in Q-learning, we explore the environment and update the Q-table until it’s ready and contains information about better actions for each state to maximize the rewards.

DQN – Deep Q Network

In the introduction section above, we mentioned incorporating neural networks with RL. DQN is the perfect example. DQN was developed by DeepMind in 2015 by leveraging Q-learning and Neural Networks.

In Q-learning, we deal with a discrete number of states and it’s easier to define and update the Q-table. In a big state space environment, it can be challenging. In DQN, instead of defining a Q-table, Neural networks are used to approximate the Q-value for each state-action pair. There are several challenges while applying neural networks in reinforcement learning, as most of the other deep learning applications use a good amount of hand-labeled training, which isn’t stable in the case of RL where reward data is sparse and noisy. 

RL data also keeps changing as an agent discovers new actions for a particular state. Using a neural network method that works only on fixed data distribution won’t help with RL. This is where DeepMind’s DQN algorithm suggested using a variant of the Q-learning method and experience replay function. 

A Q-Network can be trained by minimizing the loss function L as below:

The Loss function will change at each iteration i. In the above DQN equation, yi   is the target/reward calculation for iteration i and ρ(s, a) is a probability distribution for the sequences s and actions a, called behavior distribution.

Using consecutive samples to train the agent for learning purposes in RL isn’t useful, as samples will have a strong correlation. But, using experience replay, DQN randomizes the sampling and reduces the variance. In each iteration, we compute loss on mini-batches, which provides better data efficiency and helps to make an unbiased decision.

A3C – Asynchronous Advantage Actor-Critic

DQN uses experience replay, but also uses more memory and requires more computation. So, A3C suggests using multiple agents in parallel on multiple instances of the environment instead of experiencing replay. This technique provides a more stationary process as parallelism also helps in dealing with correlated data samples. 

In value-based model-free algorithms, the values of state-action pairs are updated based on Q-value, which makes the learning process slow. Through A3C, we can overcome DQN challenges. It’s a combination of three main components – 

  1. Asynchronous 

In DQN, a single neural network interacts with a single environment whereas, in A3C, parallel actor learners use multiple agents. Each agent will have its own network and environment and interact with it. This ultimately harvests the overall learning with the presence of a global network where agents will have more diverse data. 

  1. Actor Critic

A3C combines the good parts of both value-based and policy-based, so it’s able to predict both value and policy functions. The agents use the value predicted using the value function to find the optimal policy function. So, the value function acts as a Critic and policy as an Actor.

  1. Advantage

It’s a metric that helps to understand how much better the rewards were than expected. This improves the agent’s learning process in the environment. 

 A = Q(s, a) – V(s) 

This method has outperformed many other RL algorithms but it doesn’t mean other algorithms aren’t useful. But, in many cases, incorporating these together can help to achieve better efficiency. 

SARSA – State-action-reward-state-action

It was proposed by Rummery and Niranjan as Modified Connectionist Q-Learning(MCQ-L), but named SARSA by Rich Sutton. The name reflects the process we follow in SARSA. The purpose of this algorithm is to find the Q-value which depends on the current state and action of the agent in that state. It’s an on-policy algorithm that estimates the value of the policy based on action taken. 

Qt+1(St, At) = Qt(St, At) + αRt+1 + γQt(St+1, At+1) − Qt(St, At)

Here we consider transitions from state-action pair to state-action pair and learn the value of state-action pairs. This update is done after every transition from a nonterminal state St . If St+1 is terminal, then Qt(St+1, At+1) is defined as zero. This rule uses every element of the quintuple of events, (St, At, Rt+1, St+1, At+1), that make up a transition from one state-action pair to the next. – RL- Chapter 6.4

MBMF – Model Based Model Free

Evidence in neuroscience suggests that humans employ both MF and MB approaches for learning new skills, and switch between the two during the learning process. – MBMF RL 

MF algorithms are effective at learning complex policies, but it takes many trials and can be time-consuming where the model has to be accurate for MB to achieve a generalization. Both MB and MF have their limitations. By combining them, we can leverage their advantages. Model-based RL will use neural network models to create samples which will be utilized in Model free RL to achieve high rewards. Model-based algorithms can provide a supervised initialization of a policy that can be fine-tuned using model-free algorithms. 

Applications and use cases

Reinforcement learning has a wide range of applications. RL has aced the solutions for sequential decision-making problems. There are some non-sequential problems where researchers are leveraging RL techniques; RL is widely used in areas like gaming, healthcare, supply chain management, and more. 

  • Gaming

RL is known for its mainstream algorithms being used to solve several games with superhuman performance. AlphaGo is a computer program designed to play Go. It easily beat the human Go master. It combines the Monte Carlo Tree Search Algorithm with deep neural networks. RL is used in many other games, as well as video games, especially all the Atari Games. Agents with general intelligence have been built for PAC Man, Space Invaders, and more.

 “Reinforcement learning is much more than just an academic game. By enabling a computer to learn by itself with no hints and suggestions, the machine can act innovatively and overcome universal, human biases.” –

  • Robotics

Robotics systems are not new, and they usually perform repetitive actions in a controlled environment. There’s only a certain number of steps that need to be followed in a dedicated manner. 

Designing robots where they have to observe the surroundings and take the best course of action can be challenging, but deep learning and RL help make it possible. One good example of a robotics system can be a Robot Vacuum Cleaner in your house. At first, it won’t know your room and slowly it will learn about the area. Many such applications these days incorporate RL to make robots better.

  • Autonomous – Control Learning

Most people have heard about autonomous driving, like in Tesla cars. This area requires complex control architecture and parameter tuning is very difficult. But, once these systems are deployed, they’re mostly self-taught. 

Many things have to be considered while creating an autonomous system with RL. As you can see in the above diagram, the system/agent should be able to detect objects, plan the motion, and most importantly know when to move and when to stop. 

  • News Recommendation

Every day, tons of news content is generated, but not every news is relevant to every user. There are a lot of parameters that need to be taken into consideration. The traditional recommendation method covers all the parameters like location, time, profile, etc., but tends to recommend similar items. This decreases the reading choices and user satisfaction.

“Recommendation frameworks using RL can model future reward explicitly by considering the user return pattern as a supplement to click / no click label in order to capture more user feedback information. In addition, an effective exploration strategy is incorporated to find new attractive news for users.” – DRN – Pennsylvania State University

Learning models – hands-on

Markov Decision Process

Most Reinforcement Learning tasks can be framed as MDP. MDP is used to describe tasks where each event depends on the previous event, a property that’s called the Markov Property. This assumes that a future event of a process is solely based on the present state of that process or the whole history of the process. 

Learn more

Markov Decision Process in Reinforcement Learning: Everything You Need to Know

Typically, a Markov process consists of :

  1. An Agent that will perform actions,
  2. A set of possible routes or states,
  3. A decision environment,
  4. A goal that the agent has to achieve – reward.

We’ve already covered these components in detail in the above sections. So, here we’ll focus on creating an MDP environment from scratch and try to model a complex real-world problem this way.

For the MDP environment, we use the dynamic programming concept of breaking the problem into subproblems and store the solutions to these subproblems for later to solve the main problem. To divide the main problem into multiple subproblems, we’ll create an environment in grid form, where each grid will have different rewards or punishments. 

Let’s take a small example to understand what we’re talking about here. Say you’re driving and want to go home, but the road you usually take is busy today. There’s another road which is not busy and might get you home on time. Let’s see how an agent using RL can decide which way to go.

First, we define the problem and its states in a grid form.

HOME = "H"

mdp_grid = [
    [HOME, EMPTY],

for row in mdp_grid:
    print('|'+'|'.join(row) + '|')

The agent can only move left, right, up, and down which mean he can take only these actions:

UP = 0
DOWN = 1
LEFT = 2


Now that states and actions are known, it’s time to calculate the reward for each state-action pair and confirm if the goal has been achieved. For each iteration, it will take state and action as inputs and return a new state, action, and reward.

def path(state, action):

    def new_pos(state, action):
        p = deepcopy(state.agent_pos)
        if action == UP:
            p[0] = max(0, p[0] - 1)
        elif action == DOWN:
            p[0] = min(len(state.grid) - 1, p[0] + 1)
        elif action == LEFT:
            p[1] = max(0, p[1] - 1)
        elif action == RIGHT:
            p[1] = min(len(state.grid[0]) - 1, p[1] + 1)
            raise ValueError(f"Unknown action {action}")
        return p

    p = new_pos(state, action)
    grid_item = state.grid[p[0]][p[1]]

    new_grid = deepcopy(state.grid)

    if grid_item == TRAFFIC:
        reward = -10
        is_done = True
        new_grid[p[0]][p[1]] += AGENT
    elif grid_item == HOME:
        reward = 100
        is_done = True
        new_grid[p[0]][p[1]] += AGENT
    elif grid_item == EMPTY:
        reward = -1
        is_done = False
        old = state.agent_pos
        new_grid[old[0]][old[1]] = EMPTY
        new_grid[p[0]][p[1]] = AGENT
    elif grid_item == AGENT:
        reward = -1
        is_done = False
        raise ValueError(f"Unknown grid item {grid_item}")

    return State(grid=new_grid, agent_pos=p), reward, is_done

But how will we call this function and how will we decide that we’ve reached home? First, we will choose actions in each state randomly. Then, we’ll try to find if the path chosen is correct using a state-action pair. For each state-action pair, we’ll calculate the q-value to see which action is accurate to reach home, and then repeat this process until we are home.

Here, we’re defining some constants which we’ll use later to calculate the Q-value using the Bellman formula.



MIN_ALPHA = 0.02

alphas = np.linspace(1.0, MIN_ALPHA, N_EPISODES)
gamma = 1.0
eps = 0.2
start_state = State(grid=grid, car_pos=[1, 1])

q_table = dict()

To create a q_table, we need state-action and the respective q-value.

def q(state, action=None):

    if state not in q_table:
        q_table[state] = np.zeros(len(ACTIONS))

    if action is None:
        return q_table[state]

    return q_table[state][action]

We also need to decide what is the best action.

def choose_action(state):
    if random.uniform(0, 1) < eps:
        return random.choice(ACTIONS)
        return np.argmax(q(state))

Let’s play a few episodes and see how long it will take for the agent to decide the best way to home.

for e in range(N_EPISODES):

    state = start_state
    total_reward = 0
    alpha = alphas[e]

    for _ in range(MAX_EPISODE_STEPS):
        action = choose_action(state)
        next_state, reward, done = path(state, action)
        total_reward += reward

        q(state)[action] = q(state, action) +
                alpha * (reward + gamma *  np.max(q(next_state)) - q(state, action))
        state = next_state
        if done:
    print(f"Episode {e + 1}: total reward -> {total_reward}")

Now that our Q-table is ready, and the Agent is now aware of the best action possible for each state, let’s check how he can reach home step by step.

If you remember the grid we made earlier, the agent is in position (1,1). We’ll start the journey from 1,1 and see where the agent will go next – will he take the busy road or the road with no traffic?

r = q(start_state)
print(f"up={r[UP]}, down={r[DOWN]}, left={r[LEFT]}, right={r[RIGHT]}")

up=98.13421151608904, down=42.67994031503147, left=-10.0, right=42.25406377182159

So, you can see UP has the highest Q-value i.e. when the agent started, he took the road with no traffic. Let’s see what he’ll do next.

new_state, reward, done = path(start_state, UP)
r = q(new_state)
print(f"up={r[UP]}, down={r[DOWN]}, left={r[LEFT]}, right={r[RIGHT]}")

up=-1.0, down=0.9519170608828851, left=99.92190645654732, right=0.0

This time the agent turned left, which means he reached home successfully, and with that, we have just created our first RL solution. Here, we have created a simple Markov Decision Process from scratch. This problem will be more difficult if we add other factors like time, distance, etc. For more examples like this, you can check out the OpenAI github page

Monte Carlo Tree Search is a combination of classic tree search and reinforcement learning principles. This model is useful in combinatorial games where planning is required before making any move. In a chess game, before you move any piece, you think two or more steps ahead, running future scenarios in your head and thinking what moves your opponent can make further. The model is capable of combining different scenarios and generalizing them to find an optimal solution. 

A basic MCTS method is a simple search tree built node by node after simulated playouts. This process has 4 main steps:

  1. Selection

Using a specific strategy, the MCTS algorithm traverses the tree from root node R, recursively finds optimal child nodes, and (once the leaf node is reached) moves to the next step. MCST uses the UCB (Upper Confidence Bound) formula, which helps to keep the balance between exploration and exploitation trade-off. 

Here, Si is the value of a node i and, while traversing during the selection process, whichever child node returns the maximum value, will get selected. 

  1. Expansion

This is the process to add a new child node to the tree. This new node will be added to the previously selected node.

  1. Simulation

After expansion, the algorithm performs a simulation on a selected node. During this process, it looks into all the scenarios in the game until a result/goal is achieved.

  1. Backpropagation

Backpropagation is a step where all the previously visited nodes – from the current node to root node, will be updated with the reward value and number of visits. 

Now that you have a basic understanding of what MCTS is and its main steps, let’s build a tree search algorithm from scratch using Python. During this example, we’ll create a tic tac toe game and try to analyze the same step by step.

Create a tic tac board 3 x 3, at first, it will be empty.

def play_game():
    tree = MCTS()
    board = new_tic_tac_toe_board()
    while True:
        row_col = input("enter row,col: ")
        row, col = map(int, row_col.split(","))
        index = 3 * (row - 1) + (col - 1)
        if board.tup[index] is not None:
            raise RuntimeError("Invalid move")
        board = board.make_move(index)
        if board.terminal:
        for _ in range(50):
        board = tree.choose(board)
        if board.terminal:

def new_tic_tac_toe_board():
    return TicTacToeBoard(tup=(None,) * 9, turn=True, winner=None, terminal=False)
if __name__ == "__main__":
RL game 1
enter row,col: 3,1

Once you enter the row and column, your board will look like this.

RL game 2

Now it’s time for the computer to play his move. It will first do roll outs for 50 time and then make a move.

    def do_rollout(self, node):
        "Make the tree one layer better. (Train for one iteration.)"
        path = self._select(node)
        leaf = path[-1]
        reward = self._simulate(leaf)
        self._backpropagate(path, reward)

    def _select(self, node):
        "Find an unexplored descendent of `node`"
        path = []
        while True:
            if node not in self.children or not self.children[node]:
                return path
            unexplored = self.children[node] - self.children.keys()
            if unexplored:
                n = unexplored.pop()
                return path
            node = self._uct_select(node)

    def _expand(self, node):
        "Update the `children` dict with the children of `node`"
        if node in self.children:
        self.children[node] = node.find_children()

    def _simulate(self, node):
        "Returns the reward for a random simulation (to completion) of `node`"
        invert_reward = True
        while True:
            if node.is_terminal():
                reward = node.reward()
                return 1 - reward if invert_reward else reward
            node = node.find_random_child()
            invert_reward = not invert_reward

    def _backpropagate(self, path, reward):
        "Send the reward back up to the ancestors of the leaf"
        for node in reversed(path):
            self.N[node] += 1
            self.Q[node] += reward
            reward = 1 - reward

    def _uct_select(self, node):
        "Select a child of node, balancing exploration & exploitation"
        assert all(n in self.children for n in self.children[node])
        log_N_vertex = math.log(self.N[node])
        def uct(n):
            "Upper confidence bound for trees"
            return self.Q[n] / self.N[n] + self.exploration_weight * math.sqrt(
                log_N_vertex / self.N[n]

        return max(self.children[node], key=uct)

You can see in the do_rollout() function, it invokes all four main components of MCTS. Now, the opponent will play his turn and the board will look like this.

RL game 3

The same process will repeat until one of the players wins the game. If you’re interested in expanding the game by increasing the dimensions, here’s the basic code for your reference. 

This algorithm is used in many turn-based games: tic tac toe, chess, checkers, etc. It’s not limited to the gaming application, it can actually be expanded and used in decision processes as well.

Benefits & challenges in RL

Technology is one of those fields where no matter how many breakthroughs happen, there will always be room to evolve. While RL is one of the emerging fields in machine learning, it has some benefits over other algorithms and techniques, as well as some unique challenges.


  1. It doesn’t need labeled data for training purposes, there’s no cost involved to get the labeled training datasets.
  2. It’s goal-oriented and can be used for problems with a sequence of steps, unlike other ML algorithms which are input- and output-based.
  3. In RL, exploration, and exploitation both happen at the same time, which means RL doesn’t require retraining. While testing new approaches to finding new solutions, it also exploits the best solution so far.
  4. RL focuses on achieving long-term results which is difficult to achieve in other algorithms.
  5. RL systems are a reflection of how humans skill themselves, so these systems hardly require any interruptions and have a scope to achieve perfection.


  1. RL is not preferable to use for simpler problems.
  2. It can lead to an overload of states which can diminish the results.
  3. In most real-world problems there’s no concrete reward, but most of the RL algorithms work on the concept of looking for one specific goal.
  4. Though it’s all about trial and error in RL, it can’t be the same in all scenarios. For example, in autonomous vehicles, it can’t crash multiple times before it can decide which way to go.
  5. There are environment parameters in each algorithm and, from a definition point of view, it looks like we have control over these elements. In real-world scenarios, things can change frequently, and the accuracy of the RL system will always remain questionable. 

RL is different – supervised & unsupervised

RL is different from Supervised and Unsupervised learning, as it doesn’t need training data and works alone with the environment to achieve one defined goal. Here are some other differences:

Comparison Criteria
Supervised Learning
Unsupervised Learning
Reinforcement Learning


Learn using labeled Data

Unlabeled data for training

Interacts with the environment and follow trial and error process







No Supervision

No Supervision


Forecasting,Identity detection etc.

Fraud detection,Recommendation systems etc.

Autonomous driving, Gaming application, healthcare etc.


Throughout this article, we gathered some basic knowledge about Reinforcement Learning, from the definition to basic algorithms. We did some hands-on exercises to create Markov Decision Process and Monte Carlo Tree Search from scratch in Python. 

While writing this article and reading lots of research papers, I felt that there’s so much more to explore in this area, and a lot to experiment with. It’s exciting to see what the future will bring.

Thanks for reading!

Was the article useful?

Thank you for your feedback!