Reinforcement learning is a field of Artificial Intelligence in which you build an intelligent system that learns from its environment through interaction and evaluates what it learns in real-time.

A good example of this is self-driving cars, or when DeepMind built what we know today as AlphaGo, AlphaStar, and AlphaZero.

AlphaZero is a program built to master the games of chess, shogi and go (AlphaGo is the first program that beat a human Go master). AlphaStar plays the video game StarCraft II.

In this article, we’ll compare model-free vs model-based reinforcement learning. Along the way, we will explore:

- Fundamental concepts of Reinforcement Learning

a) Markov decision processes / Q-Value / Q-Learning / Deep Q Network - Difference between model-based and model-free reinforcement learning.
- Discrete mathematical approach to playing tennis – model-free reinforcement learning.
- Tennis game using Deep Q Network – model-based reinforcement learning.
- Comparison/Evaluation
- References to learn more

** ^{SEE RELATED ARTICLES}**👉 7 Applications of Reinforcement Learning in Finance and Trading

👉 10 Real-Life Applications of Reinforcement Learning

👉 Best Reinforcement Learning Tutorials, Examples, Projects, and Courses

## Fundamental concepts of Reinforcement Learning

Any reinforcement learning problem includes the following elements:

**Agent**– the program controlling the object of concern (for instance, a robot).**Environment**– this defines the outside world programmatically. Everything the agent(s) interacts with is part of the environment. It’s built for the agent to make it seem like a real-world case. It’s needed to prove the performance of an agent, meaning if it will do well once implemented in a real world application.**Rewards**– this gives us a score of how the algorithm performs with respect to the environment. It’s represented as 1 or 0. ‘1’ means that the policy network made the right move, ‘0’ means wrong move. In other words, rewards represent gains and losses.**Policy**– the algorithm used by the agent to decide its actions. This is the part that can be model-based or model-free.

Every problem that needs an RL solution starts with simulating an environment for the agent. Next, you build a policy network that guides the actions of the agent. The agent can then evaluate the policy if its corresponding action resulted in a gain or a loss.

The policy is our main discussion point for this article. Policy can be model-based or model-free. When building, our concern is how to optimize the policy network via policy gradient (PG).

PG algorithms directly try to optimize the policy to increase rewards. To understand these algorithms, we must take a look at Markov decision processes (MDP).

**Markov decision processes / Q-Value / Q-Learning / Deep Q Network**

MDP is a process with a fixed number of states, and it randomly evolves from one state to another at each step. The probability for it to evolve from state A to state B is fixed.

A lot of Reinforcement Learning problems with discrete actions are modeled as **Markov decision processes**, with the agent having no initial clue on the next transition state. The agent also has no idea on the rewarding principle, so it has to explore all possible states to begin to decode how to adjust to a perfect rewarding system. This will lead us to what we call Q Learning.

The **Q-Learning algorithm** is adapted from the **Q-Value** Iteration algorithm, in a situation where the agent has no prior knowledge of preferred states and rewarding principles. Q-Values can be defined as an optimal estimate of a state-action value in an MDP.

It is often said that Q-Learning doesn’t scale well to large (or even medium) MDPs with many states and actions. The solution is to approximate the Q-Value of any state-action pair (s,a). This is called Approximate Q-Learning.

DeepMind proposed the use of deep neural networks, which work much better, especially for complex problems – without the use of any feature engineering. A deep neural network used to estimate Q-Values is called a **deep Q-network (DQN).** Using DQN for approximated Q-learning is called Deep Q-Learning.

## Difference Between Model-Based and Model-Free Reinforcement Learning

In a model-based RL environment, the policy is based on the use of a machine learning model. To better understand RL Environments/Systems, what defines the system is the policy network. Knowing fully well that the policy is an algorithm that decides the action of an agent. In this case, when an RL environment or system utilizes the use of machine learning models like random forest, gradient boost, neural networks, and others, such an RL system is model-based. Moreover, this isn’t the case with a model-free RL, such a system has no policy based on the use of machine learning models; its policy is guided by the use of non-ML algorithms. For instance, trying to balance a lever system that ensures perfect stability of the effort and load while on the fulcrum (see fig 1). A simple algorithm can be written to basically, ensure that when load is greater than the effort, the fulcrum moves leftward and if otherwise, it moves right. This kind of system is, model-free because it requires no form of machine learning model to achieve stability.

To better understand this, we will explain everything with an example. In the example, we’ll build model-free and model-based RL for tennis games. To build the model, we need an environment for the policy to get implemented. However we won’t build the environment in this article, we’ll import one to use for our program.

## Pytennis environment

We’ll use the Pytennis environment to build a model-free and model-based RL system.

A tennis game requires the following:

- 2 players which implies 2 agents.
- A tennis lawn – main environment.
- A single tennis ball.
- Movement of the agents left-right (or right-left direction).

The Pytennis environment specifications are:

- There are 2 agents (2 players) with a ball.
- There’s a tennis field of dimension (x, y) – (300, 500)
- The ball was designed to move on a straight line, such that agent A decides a target point between x1 (0) and x2 (300) of side B (Agent B side), therefore it displays the ball 50 different times with respect to an FPS of 20. This makes the ball move in a straight line from source to destination. This also applies to agent B.
- Movement of Agent A and Agent B is bound between (x1= 100, to x2 = 600).
- Movement of the ball is bound along the y-axis (y1 = 100 to y2 = 600).
- Movement of the ball is bound along the x-axis (x1 = 100, to x2 = 600).

Pytennis is an environment that mimics real-life tennis situations. As shown below, the image on the left is a model-free Pytennis game, and the one on the right is model-based.

## Discrete mathematical approach to playing tennis – model-free Reinforcement Learning

Why “discrete mathematical approach to playing tennis”? Because this method is a logical implementation of the Pytennis environment.

The code below shows us the implementation of the ball movement on the lawn. You can find the source code here.

```
import time
import numpy as np
import pygame
import sys
#import seaborn as sns
from pygame.locals import *
pygame.init()
class Network:
def __init__(self, xmin, xmax, ymin, ymax):
"""
xmin: 150,
xmax: 450,
ymin: 100,
ymax: 600
"""
self.StaticDiscipline = {
'xmin': xmin,
'xmax': xmax,
'ymin': ymin,
'ymax': ymax
}
def network(self, xsource, ysource=100, Ynew=600, divisor=50): # ysource will always be 100
"""
For Network A
ysource: will always be 100
xsource: will always be between xmin and xmax (static discipline)
For Network B
ysource: will always be 600
xsource: will always be between xmin and xmax (static discipline)
"""
while True:
ListOfXsourceYSource = []
Xnew = np.random.choice([i for i in range(
self.StaticDiscipline['xmin'], self.StaticDiscipline['xmax'])], 1)
#Ynew = np.random.choice([i for i in range(self.StaticDiscipline['ymin'], self.StaticDiscipline['ymax'])], 1)
source = (xsource, ysource)
target = (Xnew[0], Ynew)
#Slope and intercept
slope = (ysource - Ynew)/(xsource - Xnew[0])
intercept = ysource - (slope*xsource)
if (slope != np.inf) and (intercept != np.inf):
break
else:
continue
#print(source, target)
# randomly select 50 new values along the slope between xsource and xnew (monotonically decreasing/increasing)
XNewList = [xsource]
if xsource < Xnew:
differences = Xnew[0] - xsource
increment = differences / divisor
newXval = xsource
for i in range(divisor):
newXval += increment
XNewList.append(int(newXval))
else:
differences = xsource - Xnew[0]
decrement = differences / divisor
newXval = xsource
for i in range(divisor):
newXval -= decrement
XNewList.append(int(newXval))
# determine the values of y, from the new values of x, using y= mx + c
yNewList = []
for i in XNewList:
findy = (slope * i) + intercept # y = mx + c
yNewList.append(int(findy))
ListOfXsourceYSource = [(x, y) for x, y in zip(XNewList, yNewList)]
return XNewList, yNewList
```

Here is how this works once the networks are initialized (Network A for Agent A and Network B for Agent B):

```
# Testing
net = Network(150, 450, 100, 600)
NetworkA = net.network(300, ysource=100, Ynew=600) # Network A
NetworkB = net.network(200, ysource=600, Ynew=100) # Network B
```

Each network is bounded by the directions of ball movement. Network A represents Agent A, which defines the movement of the ball from Agent A to any position between 100 and 300 along the x-axis at Agent B. This also applies to Network B (Agent B).

When the network is started, the .network method discretely generates 50 y-points (between y1 = 100 and y2 = 600), and corresponding x-points (between x1 which happens to be the location of the ball from Agent A to a randomly selected point x2 on Agent B side) for network A. This also applies to Network B (Agent B).

To automate the movement of each agent, the opposing agent has to move in a corresponding direction with respect to the ball. This can only be done by setting the x position of the ball to be the x position of the opposing agent, as in the code below.

```
playerax = ballx #When Agent A plays.
playerbx = ballx #When Agent B plays.
```

Meanwhile the source agent has to move back to its default position from its current position. The code below illustrates this.

```
def DefaultToPosition(x1, x2=300, divisor=50):
XNewList = []
if x1 < x2:
differences = x2 - x1
increment = differences / divisor
newXval = x1
for i in range(divisor):
newXval += increment
XNewList.append(int(np.floor(newXval)))
else:
differences = x1 - x2
decrement = differences / divisor
newXval = x1
for i in range(divisor):
newXval -= decrement
XNewList.append(int(np.floor(newXval)))
return XNewList
```

Now, to make the agents play with each other recursively, this has to run in a loop. After every 50 counts (50 frame display of the ball), the opposing player is made the next player. The code below puts all of it together in a loop.

```
def main():
while True:
display()
if nextplayer == 'A':
# playerA should play
if count == 0:
#playerax = lastxcoordinate
NetworkA = net.network(
lastxcoordinate, ysource=100, Ynew=600) # Network A
out = DefaultToPosition(lastxcoordinate)
# update lastxcoordinate
bally = NetworkA[1][count]
playerax = ballx #When Agent A plays.
count += 1
# soundObj = pygame.mixer.Sound('sound/sound.wav')
# soundObj.play()
# time.sleep(0.3)
# soundObj.stop()
else:
ballx = NetworkA[0][count]
bally = NetworkA[1][count]
playerbx = ballx
playerax = out[count]
count += 1
# let playerB play after 50 new coordinate of ball movement
if count == 49:
count = 0
nextplayer = 'B'
else:
nextplayer = 'A'
else:
# playerB can play
if count == 0:
#playerbx = lastxcoordinate
NetworkB = net.network(
lastxcoordinate, ysource=600, Ynew=100) # Network B
out = DefaultToPosition(lastxcoordinate)
# update lastxcoordinate
bally = NetworkB[1][count]
playerbx = ballx
count += 1
# soundObj = pygame.mixer.Sound('sound/sound.wav')
# soundObj.play()
# time.sleep(0.3)
# soundObj.stop()
else:
ballx = NetworkB[0][count]
bally = NetworkB[1][count]
playerbx = out[count]
playerax = ballx
count += 1
# update lastxcoordinate
# let playerA play after 50 new coordinate of ball movement
if count == 49:
count = 0
nextplayer = 'A'
else:
nextplayer = 'B'
# CHECK BALL MOVEMENT
DISPLAYSURF.blit(PLAYERA, (playerax, 50))
DISPLAYSURF.blit(PLAYERB, (playerbx, 600))
DISPLAYSURF.blit(ball, (ballx, bally))
# update last coordinate
lastxcoordinate = ballx
pygame.display.update()
fpsClock.tick(FPS)
for event in pygame.event.get():
if event.type == QUIT:
pygame.quit()
sys.exit()
return
```

And this is basic model-free reinforcement learning. It’s model-free because you need no form of learning or modelling for the 2 agents to play simultaneously and accurately.

## Tennis game using Deep Q Network – model-based Reinforcement Learning

A typical example of model-based reinforcement learning is the Deep Q Network. Source code to this work is available here.

The code below illustrates the Deep Q Network, which is the model architecture for this work.

```
from keras import Sequential, layers
from keras.optimizers import Adam
from keras.layers import Dense
from collections import deque
import numpy as np
class DQN:
def __init__(self):
self.learning_rate = 0.001
self.momentum = 0.95
self.eps_min = 0.1
self.eps_max = 1.0
self.eps_decay_steps = 2000000
self.replay_memory_size = 500
self.replay_memory = deque([], maxlen=self.replay_memory_size)
n_steps = 4000000 # total number of training steps
self.training_start = 10000 # start training after 10,000 game iterations
self.training_interval = 4 # run a training step every 4 game iterations
self.save_steps = 1000 # save the model every 1,000 training steps
self.copy_steps = 10000 # copy online DQN to target DQN every 10,000 training steps
self.discount_rate = 0.99
self.skip_start = 90 # Skip the start of every game (it's just waiting time).
self.batch_size = 100
self.iteration = 0 # game iterations
self.done = True # env needs to be reset
self.model = self.DQNmodel()
return
def DQNmodel(self):
model = Sequential()
model.add(Dense(64, input_shape=(1,), activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(10, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer=Adam(lr=self.learning_rate))
return model
def sample_memories(self, batch_size):
indices = np.random.permutation(len(self.replay_memory))[:batch_size]
cols = [[], [], [], [], []] # state, action, reward, next_state, continue
for idx in indices:
memory = self.replay_memory[idx]
for col, value in zip(cols, memory):
col.append(value)
cols = [np.array(col) for col in cols]
return (cols[0], cols[1], cols[2].reshape(-1, 1), cols[3],cols[4].reshape(-1, 1))
def epsilon_greedy(self, q_values, step):
self.epsilon = max(self.eps_min, self.eps_max - (self.eps_max-self.eps_min) * step/self.eps_decay_steps)
if np.random.rand() < self.epsilon:
return np.random.randint(10) # random action
else:
return np.argmax(q_values) # optimal action
```

In this case, we need a policy network to control the movement of each agent as they move along the x-axis. Since the values are continuous, that is from (x1 = 100 to x2 = 300), we can’t have a model that predicts or works with 200 states.

To simplify this problem, we can split x1 and x2 into 10 states / 10 actions, and define an upper and lower bound for each state.

**Note that we have 10 actions, because from a state there are 10 possibilities.**

The code below illustrates the definition of both upper and lower bounds for each state.

```
def evaluate_state_from_last_coordinate(self, c):
"""
cmax: 450
cmin: 150
c definately will be between 150 and 450.
state0 - (150 - 179)
state1 - (180 - 209)
state2 - (210 - 239)
state3 - (240 - 269)
state4 - (270 - 299)
state5 - (300 - 329)
state6 - (330 - 359)
state7 - (360 - 389)
state8 - (390 - 419)
state9 - (420 - 450)
"""
if c >= 150 and c <= 179:
return 0
elif c >= 180 and c <= 209:
return 1
elif c >= 210 and c <= 239:
return 2
elif c >= 240 and c <= 269:
return 3
elif c >= 270 and c <= 299:
return 4
elif c >= 300 and c <= 329:
return 5
elif c >= 330 and c <= 359:
return 6
elif c >= 360 and c <= 389:
return 7
elif c >= 390 and c <= 419:
return 8
elif c >= 420 and c <= 450:
return 9
```

The Deep Neural Network (DNN) used experimentally for this work is a network of 1 input (which represents the previous state), 2 hidden layers of 64 neurons each, and an output layer of 10 neurons (binary selection from 10 different states). This is shown below:

```
def DQNmodel(self):
model = Sequential()
model.add(Dense(64, input_shape=(1,), activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(10, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer=Adam(lr=self.learning_rate))
return model
```

Now that we have a DQN model that predicts the next state/action of the model, and the Pytennis environment already sorted out the ball movement in a straight line, let’s go ahead and write a function that carries out an action by an agent, based on the DQN model prediction regarding it’s next state.

The detailed code below illustrates how agent A makes a decision on where to direct the ball (on Agent B’s side and vice-versa). This code also evaluates agent B, if it was able to receive the ball.

```
def randomVal(self, action):
"""
cmax: 450
cmin: 150
c definately will be between 150 and 450.
state0 - (150 - 179)
state1 - (180 - 209)
state2 - (210 - 239)
state3 - (240 - 269)
state4 - (270 - 299)
state5 - (300 - 329)
state6 - (330 - 359)
state7 - (360 - 389)
state8 - (390 - 419)
state9 - (420 - 450)
"""
if action == 0:
val = np.random.choice([i for i in range(150, 180)])
elif action == 1:
val = np.random.choice([i for i in range(180, 210)])
elif action == 2:
val = np.random.choice([i for i in range(210, 240)])
elif action == 3:
val = np.random.choice([i for i in range(240, 270)])
elif action == 4:
val = np.random.choice([i for i in range(270, 300)])
elif action == 5:
val = np.random.choice([i for i in range(300, 330)])
elif action == 6:
val = np.random.choice([i for i in range(330, 360)])
elif action == 7:
val = np.random.choice([i for i in range(360, 390)])
elif action == 8:
val = np.random.choice([i for i in range(390, 420)])
else:
val = np.random.choice([i for i in range(420, 450)])
return val
def stepA(self, action, count=0):
# playerA should play
if count == 0:
self.NetworkA = self.net.network(
self.ballx, ysource=100, Ynew=600) # Network A
self.bally = self.NetworkA[1][count]
self.ballx = self.NetworkA[0][count]
if self.GeneralReward == True:
self.playerax = self.randomVal(action)
else:
self.playerax = self.ballx
# soundObj = pygame.mixer.Sound('sound/sound.wav')
# soundObj.play()
# time.sleep(0.4)
# soundObj.stop()
else:
self.ballx = self.NetworkA[0][count]
self.bally = self.NetworkA[1][count]
obsOne = self.evaluate_state_from_last_coordinate(
int(self.ballx)) # last state of the ball
obsTwo = self.evaluate_state_from_last_coordinate(
int(self.playerbx)) # evaluate player bx
diff = np.abs(self.ballx - self.playerbx)
obs = obsTwo
reward = self.evaluate_action(diff)
done = True
info = str(diff)
return obs, reward, done, info
def evaluate_action(self, diff):
if (int(diff) <= 30):
return True
else:
return False
```

From the code above, function stepA gets executed when AgentA has to play. While playing, AgentA uses the next action predicted by DQN to estimate the target (x2 position, at Agent B, from the current position of the ball, x1, which is on it’s own side), by using the ball trajectory network developed by the Pytennis environment to make its own move.

Agent A, for example, is able to get a precise point x2 on Agent’s B side by using the function **randomVal**, as shown above, to randomly select a coordinate x2 bounded by the action given by DQN.

Finally, function stepA evaluates the response of AgentB to target point x2 by using the function **evaluate_action**. The function **evaluate_action** defines if AgentB should be penalized or rewarded. Just as this is described for AgentA to AgentB, it applies for AgentB to AgentA (same code by different variable names).

Now that we have the policy, reward, environment, states and actions correctly defined, we can go ahead and recursively make the two agents play the game with each other.

The code below shows how turns are taken by each agent after 50 ball displays. Note that for each ball display, the DQN is making a decision on where to toss the ball for the next agent to play.

```
while iteration < iterations:
self.display()
self.randNumLabelA = self.myFontA.render(
'A (Win): '+str(self.updateRewardA) + ', A(loss): '+str(self.lossA), 1, self.BLACK)
self.randNumLabelB = self.myFontB.render(
'B (Win): '+str(self.updateRewardB) + ', B(loss): ' + str(self.lossB), 1, self.BLACK)
self.randNumLabelIter = self.myFontIter.render(
'Iterations: '+str(self.updateIter), 1, self.BLACK)
if nextplayer == 'A':
if count == 0:
# Online DQN evaluates what to do
q_valueA = self.AgentA.model.predict([stateA])
actionA = self.AgentA.epsilon_greedy(q_valueA, iteration)
# Online DQN plays
obsA, rewardA, doneA, infoA = self.stepA(
action=actionA, count=count)
next_stateA = actionA
# Let's memorize what just happened
self.AgentA.replay_memory.append(
(stateA, actionA, rewardA, next_stateA, 1.0 - doneA))
stateA = next_stateA
elif count == 49:
# Online DQN evaluates what to do
q_valueA = self.AgentA.model.predict([stateA])
actionA = self.AgentA.epsilon_greedy(q_valueA, iteration)
obsA, rewardA, doneA, infoA = self.stepA(
action=actionA, count=count)
next_stateA = actionA
self.updateRewardA += rewardA
self.computeLossA(rewardA)
# Let's memorize what just happened
self.AgentA.replay_memory.append(
(stateA, actionA, rewardA, next_stateA, 1.0 - doneA))
# restart the game if player A fails to get the ball, and let B start the game
if rewardA == 0:
self.restart = True
time.sleep(0.5)
nextplayer = 'B'
self.GeneralReward = False
else:
self.restart = False
self.GeneralReward = True
# Sample memories and use the target DQN to produce the target Q-Value
X_state_val, X_action_val, rewards, X_next_state_val, continues = (
self.AgentA.sample_memories(self.AgentA.batch_size))
next_q_values = self.AgentA.model.predict(
[X_next_state_val])
max_next_q_values = np.max(
next_q_values, axis=1, keepdims=True)
y_val = rewards + continues * self.AgentA.discount_rate * max_next_q_values
# Train the online DQN
self.AgentA.model.fit(X_state_val, tf.keras.utils.to_categorical(
X_next_state_val, num_classes=10), verbose=0)
nextplayer = 'B'
self.updateIter += 1
count = 0
# evaluate A
else:
# Online DQN evaluates what to do
q_valueA = self.AgentA.model.predict([stateA])
actionA = self.AgentA.epsilon_greedy(q_valueA, iteration)
# Online DQN plays
obsA, rewardA, doneA, infoA = self.stepA(
action=actionA, count=count)
next_stateA = actionA
# Let's memorize what just happened
self.AgentA.replay_memory.append(
(stateA, actionA, rewardA, next_stateA, 1.0 - doneA))
stateA = next_stateA
if nextplayer == 'A':
count += 1
else:
count = 0
else:
if count == 0:
# Online DQN evaluates what to do
q_valueB = self.AgentB.model.predict([stateB])
actionB = self.AgentB.epsilon_greedy(q_valueB, iteration)
# Online DQN plays
obsB, rewardB, doneB, infoB = self.stepB(
action=actionB, count=count)
next_stateB = actionB
# Let's memorize what just happened
self.AgentB.replay_memory.append(
(stateB, actionB, rewardB, next_stateB, 1.0 - doneB))
stateB = next_stateB
elif count == 49:
# Online DQN evaluates what to do
q_valueB = self.AgentB.model.predict([stateB])
actionB = self.AgentB.epsilon_greedy(q_valueB, iteration)
# Online DQN plays
obs, reward, done, info = self.stepB(
action=actionB, count=count)
next_stateB = actionB
# Let's memorize what just happened
self.AgentB.replay_memory.append(
(stateB, actionB, rewardB, next_stateB, 1.0 - doneB))
stateB = next_stateB
self.updateRewardB += rewardB
self.computeLossB(rewardB)
# restart the game if player A fails to get the ball, and let B start the game
if rewardB == 0:
self.restart = True
time.sleep(0.5)
self.GeneralReward = False
nextplayer = 'A'
else:
self.restart = False
self.GeneralReward = True
# Sample memories and use the target DQN to produce the target Q-Value
X_state_val, X_action_val, rewards, X_next_state_val, continues = (
self.AgentB.sample_memories(self.AgentB.batch_size))
next_q_values = self.AgentB.model.predict(
[X_next_state_val])
max_next_q_values = np.max(
next_q_values, axis=1, keepdims=True)
y_val = rewards + continues * self.AgentB.discount_rate * max_next_q_values
# Train the online DQN
self.AgentB.model.fit(X_state_val, tf.keras.utils.to_categorical(
X_next_state_val, num_classes=10), verbose=0)
nextplayer = 'A'
self.updateIter += 1
# evaluate B
else:
# Online DQN evaluates what to do
q_valueB = self.AgentB.model.predict([stateB])
actionB = self.AgentB.epsilon_greedy(q_valueB, iteration)
# Online DQN plays
obsB, rewardB, doneB, infoB = self.stepB(
action=actionB, count=count)
next_stateB = actionB
# Let's memorize what just happened
self.AgentB.replay_memory.append(
(stateB, actionB, rewardB, next_stateB, 1.0 - doneB))
tateB = next_stateB
if nextplayer == 'B':
count += 1
else:
count = 0
iteration += 1
```

## Comparison/Evaluation

Having played this game model-free and model-based, here are some differences that we need to be aware of:

s/n | Model-free | Model-based |
---|---|---|

1 | Rewards are not accounted for (since this is automated, reward = 1). |
Rewards are accounted for. |

2 | No modelling (no decision policy is required). |
Modelling is required (policy network). |

3 | This doesn’t require the use of initial states to predict the next state. |
This requires the use of initial states to predict the next state using the policy network. |

4 | The rate of missing the ball with respect to time is zero. |
The rate of missing the ball with respect to time approaches zero. |

If you’re interested, the videos below show these two techniques in action playing tennis games:

1. Model-free

2. Model-based

## Conclusion

Tennis might be simple compared to self-driving cars, but hopefully this example showed you a few things about RL that you didn’t know.

The main difference between model-free and model-based RL is the policy network, which is required for model-based RL and unnecessary in model-free.

It’s worth noting that oftentimes, model-based RL takes a massive amount of time for the DNN to learn the states perfectly without getting it wrong.

But every technique has its drawbacks and advantages, choosing the right one depends on what exactly you need your program to do.

Thanks for reading, I left a few additional references for you to follow if you want to explore this topic more.

## References

- AlphaGo documentary: https://www.youtube.com/watch?v=WXuK6gekU1Y
- List of reinforcement learning environments: https://medium.com/@mauriciofadelargerich/reinforcement-learning-environments-cff767bc241f
- Create your own reinforcement learning environment: https://towardsdatascience.com/create-your-own-reinforcement-learning-environment-beb12f4151ef
- Types of RL Environments: https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781838649777/1/ch01lvl1sec14/types-of-rl-environment
- Model-based Deep Q Network: https://github.com/elishatofunmi/pytennis-Deep-Q-Network-DQN
- Discrete mathematics approach youtube video: https://youtu.be/iUYxZ2tYKHw
- Deep Q Network approach YouTube video: https://youtu.be/FCwGNRiq9SY
- Model-free discrete mathematics implementation: https://github.com/elishatofunmi/pytennis-Discrete-Mathematics-Approach-
- Hands-on Machine Learning with scikit-learn and TensorFlow: https://www.amazon.com/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1491962291

**READ NEXT**

## ML Experiment Tracking: What It Is, Why It Matters, and How to Implement It

#### Jakub Czakon | Posted November 26, 2020

Let me share a story that I’ve heard too many times.

”… We were developing an ML model with my team, we ran a lot of experiments and got promising results…

…unfortunately, we couldn’t tell exactly what performed best because we forgot to save some model parameters and dataset versions…

…after a few weeks, we weren’t even sure what we have actually tried and we needed to re-run pretty much everything”– unfortunate ML researcher.

And the truth is, when you develop ML models you will run a lot of experiments.

Those experiments may:

- use different models and model hyperparameters
- use different training or evaluation data,
- run different code (including this small change that you wanted to test quickly)
- run the same code in a different environment (not knowing which PyTorch or Tensorflow version was installed)

And as a result, they can produce completely different evaluation metrics.

Keeping track of all that information can very quickly become really hard. Especially if you want to organize and compare those experiments and feel confident that you know which setup produced the best result.

This is where ML experiment tracking comes in.

Continue reading ->