Model-Based and Model-Free Reinforcement Learning: Pytennis Case Study
Reinforcement learning is a field of Artificial Intelligence in which you build an intelligent system that learns from its environment through interaction and evaluates what it learns in real-time.
A good example of this is self-driving cars, or when DeepMind built what we know today as AlphaGo, AlphaStar, and AlphaZero.
AlphaZero is a program built to master the games of chess, shogi and go (AlphaGo is the first program that beat a human Go master). AlphaStar plays the video game StarCraft II.
In this article, we’ll compare model-free vs model-based reinforcement learning. Along the way, we will explore:
- Fundamental concepts of Reinforcement Learning
a) Markov decision processes / Q-Value / Q-Learning / Deep Q Network - Difference between model-based and model-free reinforcement learning.
- Discrete mathematical approach to playing tennis – model-free reinforcement learning.
- Tennis game using Deep Q Network – model-based reinforcement learning.
- Comparison/Evaluation
- References to learn more
SEE RELATED ARTICLES
7 Applications of Reinforcement Learning in Finance and Trading
10 Real-Life Applications of Reinforcement Learning
Best Reinforcement Learning Tutorials, Examples, Projects, and Courses
Fundamental concepts of Reinforcement Learning
Any reinforcement learning problem includes the following elements:
- Agent – the program controlling the object of concern (for instance, a robot).
- Environment – this defines the outside world programmatically. Everything the agent(s) interacts with is part of the environment. It’s built for the agent to make it seem like a real-world case. It’s needed to prove the performance of an agent, meaning if it will do well once implemented in a real world application.
- Rewards – this gives us a score of how the algorithm performs with respect to the environment. It’s represented as 1 or 0. ‘1’ means that the policy network made the right move, ‘0’ means wrong move. In other words, rewards represent gains and losses.
- Policy – the algorithm used by the agent to decide its actions. This is the part that can be model-based or model-free.
Every problem that needs an RL solution starts with simulating an environment for the agent. Next, you build a policy network that guides the actions of the agent. The agent can then evaluate the policy if its corresponding action resulted in a gain or a loss.
The policy is our main discussion point for this article. Policy can be model-based or model-free. When building, our concern is how to optimize the policy network via policy gradient (PG).
PG algorithms directly try to optimize the policy to increase rewards. To understand these algorithms, we must take a look at Markov decision processes (MDP).
Markov decision processes / Q-Value / Q-Learning / Deep Q Network
MDP is a process with a fixed number of states, and it randomly evolves from one state to another at each step. The probability for it to evolve from state A to state B is fixed.
A lot of Reinforcement Learning problems with discrete actions are modeled as Markov decision processes, with the agent having no initial clue on the next transition state. The agent also has no idea on the rewarding principle, so it has to explore all possible states to begin to decode how to adjust to a perfect rewarding system. This will lead us to what we call Q Learning.
The Q-Learning algorithm is adapted from the Q-Value Iteration algorithm, in a situation where the agent has no prior knowledge of preferred states and rewarding principles. Q-Values can be defined as an optimal estimate of a state-action value in an MDP.
It is often said that Q-Learning doesn’t scale well to large (or even medium) MDPs with many states and actions. The solution is to approximate the Q-Value of any state-action pair (s,a). This is called Approximate Q-Learning.
DeepMind proposed the use of deep neural networks, which work much better, especially for complex problems – without the use of any feature engineering. A deep neural network used to estimate Q-Values is called a deep Q-network (DQN). Using DQN for approximated Q-learning is called Deep Q-Learning.
Difference between model-based and model-free Reinforcement Learning
RL algorithms can be mainly divided into two categories – model-based and model-free.
Model-based, as it sounds, has an agent trying to understand its environment and creating a model for it based on its interactions with this environment. In such a system, preferences take priority over the consequences of the actions i.e. the greedy agent will always try to perform an action that will get the maximum reward irrespective of what that action may cause.
On the other hand, model-free algorithms seek to learn the consequences of their actions through experience via algorithms such as Policy Gradient, Q-Learning, etc. In other words, such an algorithm will carry out an action multiple times and will adjust the policy (the strategy behind its actions) for optimal rewards, based on the outcomes.
Think of it this way, if the agent can predict the reward for some action before actually performing it thereby planning what it should do, the algorithm is model-based. While if it actually needs to carry out the action to see what happens and learn from it, it is model-free.
This results in different applications for these two classes, for e.g. a model-based approach may be the perfect fit for playing chess or for a robotic arm in the assembly line of a product, where the environment is static and getting the task done most efficiently is our main concern. However, in the case of real-world applications such as self-driving cars, a model-based approach might prompt the car to run over a pedestrian to reach its destination in less time (maximum reward), but a model-free approach would make the car wait till the road is clear (optimal way out).
To better understand this, we will explain everything with an example. In the example, we’ll build model-free and model-based RL for tennis games. To build the model, we need an environment for the policy to get implemented. However we won’t build the environment in this article, we’ll import one to use for our program.
Pytennis environment
We’ll use the Pytennis environment to build a model-free and model-based RL system.
A tennis game requires the following:
- 2 players which implies 2 agents.
- A tennis lawn – main environment.
- A single tennis ball.
- Movement of the agents left-right (or right-left direction).
The Pytennis environment specifications are:
- There are 2 agents (2 players) with a ball.
- There’s a tennis field of dimension (x, y) – (300, 500)
- The ball was designed to move on a straight line, such that agent A decides a target point between x1 (0) and x2 (300) of side B (Agent B side), therefore it displays the ball 50 different times with respect to an FPS of 20. This makes the ball move in a straight line from source to destination. This also applies to agent B.
- Movement of Agent A and Agent B is bound between (x1= 100, to x2 = 600).
- Movement of the ball is bound along the y-axis (y1 = 100 to y2 = 600).
- Movement of the ball is bound along the x-axis (x1 = 100, to x2 = 600).
Pytennis is an environment that mimics real-life tennis situations. As shown below, the image on the left is a model-free Pytennis game, and the one on the right is model-based.


Discrete mathematical approach to playing tennis – model-free Reinforcement Learning
Why “discrete mathematical approach to playing tennis”? Because this method is a logical implementation of the Pytennis environment.
The code below shows us the implementation of the ball movement on the lawn. You can find the source code here.
import time
import numpy as np
import pygame
import sys
#import seaborn as sns
from pygame.locals import *
pygame.init()
class Network:
def __init__(self, xmin, xmax, ymin, ymax):
"""
xmin: 150,
xmax: 450,
ymin: 100,
ymax: 600
"""
self.StaticDiscipline = {
'xmin': xmin,
'xmax': xmax,
'ymin': ymin,
'ymax': ymax
}
def network(self, xsource, ysource=100, Ynew=600, divisor=50): # ysource will always be 100
"""
For Network A
ysource: will always be 100
xsource: will always be between xmin and xmax (static discipline)
For Network B
ysource: will always be 600
xsource: will always be between xmin and xmax (static discipline)
"""
while True:
ListOfXsourceYSource = []
Xnew = np.random.choice([i for i in range(
self.StaticDiscipline['xmin'], self.StaticDiscipline['xmax'])], 1)
#Ynew = np.random.choice([i for i in range(self.StaticDiscipline['ymin'], self.StaticDiscipline['ymax'])], 1)
source = (xsource, ysource)
target = (Xnew[0], Ynew)
#Slope and intercept
slope = (ysource - Ynew)/(xsource - Xnew[0])
intercept = ysource - (slope*xsource)
if (slope != np.inf) and (intercept != np.inf):
break
else:
continue
#print(source, target)
# randomly select 50 new values along the slope between xsource and xnew (monotonically decreasing/increasing)
XNewList = [xsource]
if xsource < Xnew:
differences = Xnew[0] - xsource
increment = differences / divisor
newXval = xsource
for i in range(divisor):
newXval += increment
XNewList.append(int(newXval))
else:
differences = xsource - Xnew[0]
decrement = differences / divisor
newXval = xsource
for i in range(divisor):
newXval -= decrement
XNewList.append(int(newXval))
# determine the values of y, from the new values of x, using y= mx + c
yNewList = []
for i in XNewList:
findy = (slope * i) + intercept # y = mx + c
yNewList.append(int(findy))
ListOfXsourceYSource = [(x, y) for x, y in zip(XNewList, yNewList)]
return XNewList, yNewList
Here is how this works once the networks are initialized (Network A for Agent A and Network B for Agent B):
# Testing
net = Network(150, 450, 100, 600)
NetworkA = net.network(300, ysource=100, Ynew=600) # Network A
NetworkB = net.network(200, ysource=600, Ynew=100) # Network B
Each network is bounded by the directions of ball movement. Network A represents Agent A, which defines the movement of the ball from Agent A to any position between 100 and 300 along the x-axis at Agent B. This also applies to Network B (Agent B).
When the network is started, the .network method discretely generates 50 y-points (between y1 = 100 and y2 = 600), and corresponding x-points (between x1 which happens to be the location of the ball from Agent A to a randomly selected point x2 on Agent B side) for network A. This also applies to Network B (Agent B).
To automate the movement of each agent, the opposing agent has to move in a corresponding direction with respect to the ball. This can only be done by setting the x position of the ball to be the x position of the opposing agent, as in the code below.
playerax = ballx #When Agent A plays.
playerbx = ballx #When Agent B plays.
Meanwhile the source agent has to move back to its default position from its current position. The code below illustrates this.
def DefaultToPosition(x1, x2=300, divisor=50):
XNewList = []
if x1 < x2:
differences = x2 - x1
increment = differences / divisor
newXval = x1
for i in range(divisor):
newXval += increment
XNewList.append(int(np.floor(newXval)))
else:
differences = x1 - x2
decrement = differences / divisor
newXval = x1
for i in range(divisor):
newXval -= decrement
XNewList.append(int(np.floor(newXval)))
return XNewList
Now, to make the agents play with each other recursively, this has to run in a loop. After every 50 counts (50 frame display of the ball), the opposing player is made the next player. The code below puts all of it together in a loop.
def main():
while True:
display()
if nextplayer == 'A':
# playerA should play
if count == 0:
#playerax = lastxcoordinate
NetworkA = net.network(
lastxcoordinate, ysource=100, Ynew=600) # Network A
out = DefaultToPosition(lastxcoordinate)
# update lastxcoordinate
bally = NetworkA[1][count]
playerax = ballx #When Agent A plays.
count += 1
# soundObj = pygame.mixer.Sound('sound/sound.wav')
# soundObj.play()
# time.sleep(0.3)
# soundObj.stop()
else:
ballx = NetworkA[0][count]
bally = NetworkA[1][count]
playerbx = ballx
playerax = out[count]
count += 1
# let playerB play after 50 new coordinate of ball movement
if count == 49:
count = 0
nextplayer = 'B'
else:
nextplayer = 'A'
else:
# playerB can play
if count == 0:
#playerbx = lastxcoordinate
NetworkB = net.network(
lastxcoordinate, ysource=600, Ynew=100) # Network B
out = DefaultToPosition(lastxcoordinate)
# update lastxcoordinate
bally = NetworkB[1][count]
playerbx = ballx
count += 1
# soundObj = pygame.mixer.Sound('sound/sound.wav')
# soundObj.play()
# time.sleep(0.3)
# soundObj.stop()
else:
ballx = NetworkB[0][count]
bally = NetworkB[1][count]
playerbx = out[count]
playerax = ballx
count += 1
# update lastxcoordinate
# let playerA play after 50 new coordinate of ball movement
if count == 49:
count = 0
nextplayer = 'A'
else:
nextplayer = 'B'
# CHECK BALL MOVEMENT
DISPLAYSURF.blit(PLAYERA, (playerax, 50))
DISPLAYSURF.blit(PLAYERB, (playerbx, 600))
DISPLAYSURF.blit(ball, (ballx, bally))
# update last coordinate
lastxcoordinate = ballx
pygame.display.update()
fpsClock.tick(FPS)
for event in pygame.event.get():
if event.type == QUIT:
pygame.quit()
sys.exit()
return
And this is basic model-free reinforcement learning. It’s model-free because you need no form of learning or modelling for the 2 agents to play simultaneously and accurately.
Tennis game using Deep Q Network – model-based Reinforcement Learning
A typical example of model-based reinforcement learning is the Deep Q Network. Source code to this work is available here.
The code below illustrates the Deep Q Network, which is the model architecture for this work.
from keras import Sequential, layers
from keras.optimizers import Adam
from keras.layers import Dense
from collections import deque
import numpy as np
class DQN:
def __init__(self):
self.learning_rate = 0.001
self.momentum = 0.95
self.eps_min = 0.1
self.eps_max = 1.0
self.eps_decay_steps = 2000000
self.replay_memory_size = 500
self.replay_memory = deque([], maxlen=self.replay_memory_size)
n_steps = 4000000 # total number of training steps
self.training_start = 10000 # start training after 10,000 game iterations
self.training_interval = 4 # run a training step every 4 game iterations
self.save_steps = 1000 # save the model every 1,000 training steps
self.copy_steps = 10000 # copy online DQN to target DQN every 10,000 training steps
self.discount_rate = 0.99
self.skip_start = 90 # Skip the start of every game (it's just waiting time).
self.batch_size = 100
self.iteration = 0 # game iterations
self.done = True # env needs to be reset
self.model = self.DQNmodel()
return
def DQNmodel(self):
model = Sequential()
model.add(Dense(64, input_shape=(1,), activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(10, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer=Adam(lr=self.learning_rate))
return model
def sample_memories(self, batch_size):
indices = np.random.permutation(len(self.replay_memory))[:batch_size]
cols = [[], [], [], [], []] # state, action, reward, next_state, continue
for idx in indices:
memory = self.replay_memory[idx]
for col, value in zip(cols, memory):
col.append(value)
cols = [np.array(col) for col in cols]
return (cols[0], cols[1], cols[2].reshape(-1, 1), cols[3],cols[4].reshape(-1, 1))
def epsilon_greedy(self, q_values, step):
self.epsilon = max(self.eps_min, self.eps_max - (self.eps_max-self.eps_min) * step/self.eps_decay_steps)
if np.random.rand() < self.epsilon:
return np.random.randint(10) # random action
else:
return np.argmax(q_values) # optimal action
In this case, we need a policy network to control the movement of each agent as they move along the x-axis. Since the values are continuous, that is from (x1 = 100 to x2 = 300), we can’t have a model that predicts or works with 200 states.
To simplify this problem, we can split x1 and x2 into 10 states / 10 actions, and define an upper and lower bound for each state.
Note that we have 10 actions, because from a state there are 10 possibilities.
The code below illustrates the definition of both upper and lower bounds for each state.
def evaluate_state_from_last_coordinate(self, c):
"""
cmax: 450
cmin: 150
c definately will be between 150 and 450.
state0 - (150 - 179)
state1 - (180 - 209)
state2 - (210 - 239)
state3 - (240 - 269)
state4 - (270 - 299)
state5 - (300 - 329)
state6 - (330 - 359)
state7 - (360 - 389)
state8 - (390 - 419)
state9 - (420 - 450)
"""
if c >= 150 and c <= 179:
return 0
elif c >= 180 and c <= 209:
return 1
elif c >= 210 and c <= 239:
return 2
elif c >= 240 and c <= 269:
return 3
elif c >= 270 and c <= 299:
return 4
elif c >= 300 and c <= 329:
return 5
elif c >= 330 and c <= 359:
return 6
elif c >= 360 and c <= 389:
return 7
elif c >= 390 and c <= 419:
return 8
elif c >= 420 and c <= 450:
return 9
The Deep Neural Network (DNN) used experimentally for this work is a network of 1 input (which represents the previous state), 2 hidden layers of 64 neurons each, and an output layer of 10 neurons (binary selection from 10 different states). This is shown below:
def DQNmodel(self):
model = Sequential()
model.add(Dense(64, input_shape=(1,), activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(10, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer=Adam(lr=self.learning_rate))
return model
Now that we have a DQN model that predicts the next state/action of the model, and the Pytennis environment already sorted out the ball movement in a straight line, let’s go ahead and write a function that carries out an action by an agent, based on the DQN model prediction regarding it’s next state.
The detailed code below illustrates how agent A makes a decision on where to direct the ball (on Agent B’s side and vice-versa). This code also evaluates agent B, if it was able to receive the ball.
def randomVal(self, action):
"""
cmax: 450
cmin: 150
c definately will be between 150 and 450.
state0 - (150 - 179)
state1 - (180 - 209)
state2 - (210 - 239)
state3 - (240 - 269)
state4 - (270 - 299)
state5 - (300 - 329)
state6 - (330 - 359)
state7 - (360 - 389)
state8 - (390 - 419)
state9 - (420 - 450)
"""
if action == 0:
val = np.random.choice([i for i in range(150, 180)])
elif action == 1:
val = np.random.choice([i for i in range(180, 210)])
elif action == 2:
val = np.random.choice([i for i in range(210, 240)])
elif action == 3:
val = np.random.choice([i for i in range(240, 270)])
elif action == 4:
val = np.random.choice([i for i in range(270, 300)])
elif action == 5:
val = np.random.choice([i for i in range(300, 330)])
elif action == 6:
val = np.random.choice([i for i in range(330, 360)])
elif action == 7:
val = np.random.choice([i for i in range(360, 390)])
elif action == 8:
val = np.random.choice([i for i in range(390, 420)])
else:
val = np.random.choice([i for i in range(420, 450)])
return val
def stepA(self, action, count=0):
# playerA should play
if count == 0:
self.NetworkA = self.net.network(
self.ballx, ysource=100, Ynew=600) # Network A
self.bally = self.NetworkA[1][count]
self.ballx = self.NetworkA[0][count]
if self.GeneralReward == True:
self.playerax = self.randomVal(action)
else:
self.playerax = self.ballx
# soundObj = pygame.mixer.Sound('sound/sound.wav')
# soundObj.play()
# time.sleep(0.4)
# soundObj.stop()
else:
self.ballx = self.NetworkA[0][count]
self.bally = self.NetworkA[1][count]
obsOne = self.evaluate_state_from_last_coordinate(
int(self.ballx)) # last state of the ball
obsTwo = self.evaluate_state_from_last_coordinate(
int(self.playerbx)) # evaluate player bx
diff = np.abs(self.ballx - self.playerbx)
obs = obsTwo
reward = self.evaluate_action(diff)
done = True
info = str(diff)
return obs, reward, done, info
def evaluate_action(self, diff):
if (int(diff) <= 30):
return True
else:
return False
From the code above, function stepA gets executed when AgentA has to play. While playing, AgentA uses the next action predicted by DQN to estimate the target (x2 position, at Agent B, from the current position of the ball, x1, which is on it’s own side), by using the ball trajectory network developed by the Pytennis environment to make its own move.
Agent A, for example, is able to get a precise point x2 on Agent’s B side by using the function randomVal, as shown above, to randomly select a coordinate x2 bounded by the action given by DQN.
Finally, function stepA evaluates the response of AgentB to target point x2 by using the function evaluate_action. The function evaluate_action defines if AgentB should be penalized or rewarded. Just as this is described for AgentA to AgentB, it applies for AgentB to AgentA (same code by different variable names).
Now that we have the policy, reward, environment, states and actions correctly defined, we can go ahead and recursively make the two agents play the game with each other.
The code below shows how turns are taken by each agent after 50 ball displays. Note that for each ball display, the DQN is making a decision on where to toss the ball for the next agent to play.
while iteration < iterations:
self.display()
self.randNumLabelA = self.myFontA.render(
'A (Win): '+str(self.updateRewardA) + ', A(loss): '+str(self.lossA), 1, self.BLACK)
self.randNumLabelB = self.myFontB.render(
'B (Win): '+str(self.updateRewardB) + ', B(loss): ' + str(self.lossB), 1, self.BLACK)
self.randNumLabelIter = self.myFontIter.render(
'Iterations: '+str(self.updateIter), 1, self.BLACK)
if nextplayer == 'A':
if count == 0:
# Online DQN evaluates what to do
q_valueA = self.AgentA.model.predict([stateA])
actionA = self.AgentA.epsilon_greedy(q_valueA, iteration)
# Online DQN plays
obsA, rewardA, doneA, infoA = self.stepA(
action=actionA, count=count)
next_stateA = actionA
# Let's memorize what just happened
self.AgentA.replay_memory.append(
(stateA, actionA, rewardA, next_stateA, 1.0 - doneA))
stateA = next_stateA
elif count == 49:
# Online DQN evaluates what to do
q_valueA = self.AgentA.model.predict([stateA])
actionA = self.AgentA.epsilon_greedy(q_valueA, iteration)
obsA, rewardA, doneA, infoA = self.stepA(
action=actionA, count=count)
next_stateA = actionA
self.updateRewardA += rewardA
self.computeLossA(rewardA)
# Let's memorize what just happened
self.AgentA.replay_memory.append(
(stateA, actionA, rewardA, next_stateA, 1.0 - doneA))
# restart the game if player A fails to get the ball, and let B start the game
if rewardA == 0:
self.restart = True
time.sleep(0.5)
nextplayer = 'B'
self.GeneralReward = False
else:
self.restart = False
self.GeneralReward = True
# Sample memories and use the target DQN to produce the target Q-Value
X_state_val, X_action_val, rewards, X_next_state_val, continues = (
self.AgentA.sample_memories(self.AgentA.batch_size))
next_q_values = self.AgentA.model.predict(
[X_next_state_val])
max_next_q_values = np.max(
next_q_values, axis=1, keepdims=True)
y_val = rewards + continues * self.AgentA.discount_rate * max_next_q_values
# Train the online DQN
self.AgentA.model.fit(X_state_val, tf.keras.utils.to_categorical(
X_next_state_val, num_classes=10), verbose=0)
nextplayer = 'B'
self.updateIter += 1
count = 0
# evaluate A
else:
# Online DQN evaluates what to do
q_valueA = self.AgentA.model.predict([stateA])
actionA = self.AgentA.epsilon_greedy(q_valueA, iteration)
# Online DQN plays
obsA, rewardA, doneA, infoA = self.stepA(
action=actionA, count=count)
next_stateA = actionA
# Let's memorize what just happened
self.AgentA.replay_memory.append(
(stateA, actionA, rewardA, next_stateA, 1.0 - doneA))
stateA = next_stateA
if nextplayer == 'A':
count += 1
else:
count = 0
else:
if count == 0:
# Online DQN evaluates what to do
q_valueB = self.AgentB.model.predict([stateB])
actionB = self.AgentB.epsilon_greedy(q_valueB, iteration)
# Online DQN plays
obsB, rewardB, doneB, infoB = self.stepB(
action=actionB, count=count)
next_stateB = actionB
# Let's memorize what just happened
self.AgentB.replay_memory.append(
(stateB, actionB, rewardB, next_stateB, 1.0 - doneB))
stateB = next_stateB
elif count == 49:
# Online DQN evaluates what to do
q_valueB = self.AgentB.model.predict([stateB])
actionB = self.AgentB.epsilon_greedy(q_valueB, iteration)
# Online DQN plays
obs, reward, done, info = self.stepB(
action=actionB, count=count)
next_stateB = actionB
# Let's memorize what just happened
self.AgentB.replay_memory.append(
(stateB, actionB, rewardB, next_stateB, 1.0 - doneB))
stateB = next_stateB
self.updateRewardB += rewardB
self.computeLossB(rewardB)
# restart the game if player A fails to get the ball, and let B start the game
if rewardB == 0:
self.restart = True
time.sleep(0.5)
self.GeneralReward = False
nextplayer = 'A'
else:
self.restart = False
self.GeneralReward = True
# Sample memories and use the target DQN to produce the target Q-Value
X_state_val, X_action_val, rewards, X_next_state_val, continues = (
self.AgentB.sample_memories(self.AgentB.batch_size))
next_q_values = self.AgentB.model.predict(
[X_next_state_val])
max_next_q_values = np.max(
next_q_values, axis=1, keepdims=True)
y_val = rewards + continues * self.AgentB.discount_rate * max_next_q_values
# Train the online DQN
self.AgentB.model.fit(X_state_val, tf.keras.utils.to_categorical(
X_next_state_val, num_classes=10), verbose=0)
nextplayer = 'A'
self.updateIter += 1
# evaluate B
else:
# Online DQN evaluates what to do
q_valueB = self.AgentB.model.predict([stateB])
actionB = self.AgentB.epsilon_greedy(q_valueB, iteration)
# Online DQN plays
obsB, rewardB, doneB, infoB = self.stepB(
action=actionB, count=count)
next_stateB = actionB
# Let's memorize what just happened
self.AgentB.replay_memory.append(
(stateB, actionB, rewardB, next_stateB, 1.0 - doneB))
tateB = next_stateB
if nextplayer == 'B':
count += 1
else:
count = 0
iteration += 1
Comparison/Evaluation
Having played this game model-free and model-based, here are some differences that we need to be aware of:
s/n
|
Model-free
|
Model-based
|
1 |
rewards are not accounted for (since this is automated, reward = 1) |
rewards are accounted for |
2 |
no modelling (no decision policy is required) |
modelling is required (policy network) |
3 |
this doesn’t require the use of initial states to predict the next state |
this requires the use of initial states to predict the next state using the policy network |
4 |
the rate of missing the ball with respect to time is zero |
the rate of missing the ball with respect to time approaches zero |
If you’re interested, the videos below show these two techniques in action playing tennis games:
1. Model-free
2. Model-based
Conclusion
Tennis might be simple compared to self-driving cars, but hopefully this example showed you a few things about RL that you didn’t know.
The main difference between model-free and model-based RL is the policy network, which is required for model-based RL and unnecessary in model-free.
It’s worth noting that oftentimes, model-based RL takes a massive amount of time for the DNN to learn the states perfectly without getting it wrong.
But every technique has its drawbacks and advantages, choosing the right one depends on what exactly you need your program to do.
Thanks for reading, I left a few additional references for you to follow if you want to explore this topic more.
References
- AlphaGo documentary: https://www.youtube.com/watch?v=WXuK6gekU1Y
- List of reinforcement learning environments: https://medium.com/@mauriciofadelargerich/reinforcement-learning-environments-cff767bc241f
- Create your own reinforcement learning environment: https://towardsdatascience.com/create-your-own-reinforcement-learning-environment-beb12f4151ef
- Types of RL Environments: https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781838649777/1/ch01lvl1sec14/types-of-rl-environment
- Model-based Deep Q Network: https://github.com/elishatofunmi/pytennis-Deep-Q-Network-DQN
- Discrete mathematics approach youtube video: https://youtu.be/iUYxZ2tYKHw
- Deep Q Network approach YouTube video: https://youtu.be/FCwGNRiq9SY
- Model-free discrete mathematics implementation: https://github.com/elishatofunmi/pytennis-Discrete-Mathematics-Approach-
- Hands-on Machine Learning with scikit-learn and TensorFlow: https://www.amazon.com/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1491962291