ModelBased and ModelFree Reinforcement Learning: Pytennis Case Study
Reinforcement learning is a field of Artificial Intelligence in which you build an intelligent system that learns from its environment through interaction and evaluates what it learns in realtime.
A good example of this is selfdriving cars, or when DeepMind built what we know today as AlphaGo, AlphaStar, and AlphaZero.
AlphaZero is a program built to master the games of chess, shogi and go (AlphaGo is the first program that beat a human Go master). AlphaStar plays the video game StarCraft II.
In this article, weâll compare modelfree vs modelbased reinforcement learning. Along the way, we will explore:
 Fundamental concepts of Reinforcement Learning
a) Markov decision processes / QValue / QLearning / Deep Q Network  Difference between modelbased and modelfree reinforcement learning.
 Discrete mathematical approach to playing tennis – modelfree reinforcement learning.
 Tennis game using Deep Q Network – modelbased reinforcement learning.
 Comparison/Evaluation
 References to learn more
SEE RELATED ARTICLES
7 Applications of Reinforcement Learning in Finance and Trading
10 RealLife Applications of Reinforcement Learning
Best Reinforcement Learning Tutorials, Examples, Projects, and Courses
Fundamental concepts of Reinforcement Learning
Any reinforcement learning problem includes the following elements:
 Agent – the program controlling the object of concern (for instance, a robot).
 Environment – this defines the outside world programmatically. Everything the agent(s) interacts with is part of the environment. Itâs built for the agent to make it seem like a realworld case. Itâs needed to prove the performance of an agent, meaning if it will do well once implemented in a real world application.
 Rewards – this gives us a score of how the algorithm performs with respect to the environment. Itâs represented as 1 or 0. â1â means that the policy network made the right move, â0â means wrong move. In other words, rewards represent gains and losses.
 Policy – the algorithm used by the agent to decide its actions. This is the part that can be modelbased or modelfree.
Every problem that needs an RL solution starts with simulating an environment for the agent. Next, you build a policy network that guides the actions of the agent. The agent can then evaluate the policy if its corresponding action resulted in a gain or a loss.
The policy is our main discussion point for this article. Policy can be modelbased or modelfree. When building, our concern is how to optimize the policy network via policy gradient (PG).
PG algorithms directly try to optimize the policy to increase rewards. To understand these algorithms, we must take a look at Markov decision processes (MDP).
Markov decision processes / QValue / QLearning / Deep Q Network
MDP is a process with a fixed number of states, and it randomly evolves from one state to another at each step. The probability for it to evolve from state A to state B is fixed.
A lot of Reinforcement Learning problems with discrete actions are modeled as Markov decision processes, with the agent having no initial clue on the next transition state. The agent also has no idea on the rewarding principle, so it has to explore all possible states to begin to decode how to adjust to a perfect rewarding system. This will lead us to what we call Q Learning.
The QLearning algorithm is adapted from the QValue Iteration algorithm, in a situation where the agent has no prior knowledge of preferred states and rewarding principles. QValues can be defined as an optimal estimate of a stateaction value in an MDP.
It is often said that QLearning doesnât scale well to large (or even medium) MDPs with many states and actions. The solution is to approximate the QValue of any stateaction pair (s,a). This is called Approximate QLearning.
DeepMind proposed the use of deep neural networks, which work much better, especially for complex problems – without the use of any feature engineering. A deep neural network used to estimate QValues is called a deep Qnetwork (DQN). Using DQN for approximated Qlearning is called Deep QLearning.
Difference between modelbased and modelfree Reinforcement Learning
RL algorithms can be mainly divided into two categories – modelbased and modelfree.
Modelbased, as it sounds, has an agent trying to understand its environment and creating a model for it based on its interactions with this environment. In such a system, preferences take priority over the consequences of the actions i.e. the greedy agent will always try to perform an action that will get the maximum reward irrespective of what that action may cause.
On the other hand, modelfree algorithms seek to learn the consequences of their actions through experience via algorithms such as Policy Gradient, QLearning, etc. In other words, such an algorithm will carry out an action multiple times and will adjust the policy (the strategy behind its actions) for optimal rewards, based on the outcomes.
Think of it this way, if the agent can predict the reward for some action before actually performing it thereby planning what it should do, the algorithm is modelbased. While if it actually needs to carry out the action to see what happens and learn from it, it is modelfree.
This results in different applications for these two classes, for e.g. a modelbased approach may be the perfect fit for playing chess or for a robotic arm in the assembly line of a product, where the environment is static and getting the task done most efficiently is our main concern. However, in the case of realworld applications such as selfdriving cars, a modelbased approach might prompt the car to run over a pedestrian to reach its destination in less time (maximum reward), but a modelfree approach would make the car wait till the road is clear (optimal way out).
To better understand this, we will explain everything with an example. In the example, weâll build modelfree and modelbased RL for tennis games. To build the model, we need an environment for the policy to get implemented. However we wonât build the environment in this article, weâll import one to use for our program.
Pytennis environment
Weâll use the Pytennis environment to build a modelfree and modelbased RL system.
A tennis game requires the following:
 2 players which implies 2 agents.
 A tennis lawn – main environment.
 A single tennis ball.
 Movement of the agents leftright (or rightleft direction).
The Pytennis environment specifications are:
 There are 2 agents (2 players) with a ball.
 Thereâs a tennis field of dimension (x, y) – (300, 500)
 The ball was designed to move on a straight line, such that agent A decides a target point between x1 (0) and x2 (300) of side B (Agent B side), therefore it displays the ball 50 different times with respect to an FPS of 20. This makes the ball move in a straight line from source to destination. This also applies to agent B.
 Movement of Agent A and Agent B is bound between (x1= 100, to x2 = 600).
 Movement of the ball is bound along the yaxis (y1 = 100 to y2 = 600).
 Movement of the ball is bound along the xaxis (x1 = 100, to x2 = 600).
Pytennis is an environment that mimics reallife tennis situations. As shown below, the image on the left is a modelfree Pytennis game, and the one on the right is modelbased.
Discrete mathematical approach to playing tennis – modelfree Reinforcement Learning
Why âdiscrete mathematical approach to playing tennisâ? Because this method is a logical implementation of the Pytennis environment.
The code below shows us the implementation of the ball movement on the lawn. You can find the source code here.
import time
import numpy as np
import pygame
import sys
#import seaborn as sns
from pygame.locals import *
pygame.init()
class Network:
def __init__(self, xmin, xmax, ymin, ymax):
"""
xmin: 150,
xmax: 450,
ymin: 100,
ymax: 600
"""
self.StaticDiscipline = {
'xmin': xmin,
'xmax': xmax,
'ymin': ymin,
'ymax': ymax
}
def network(self, xsource, ysource=100, Ynew=600, divisor=50): # ysource will always be 100
"""
For Network A
ysource: will always be 100
xsource: will always be between xmin and xmax (static discipline)
For Network B
ysource: will always be 600
xsource: will always be between xmin and xmax (static discipline)
"""
while True:
ListOfXsourceYSource = []
Xnew = np.random.choice([i for i in range(
self.StaticDiscipline['xmin'], self.StaticDiscipline['xmax'])], 1)
#Ynew = np.random.choice([i for i in range(self.StaticDiscipline['ymin'], self.StaticDiscipline['ymax'])], 1)
source = (xsource, ysource)
target = (Xnew[0], Ynew)
#Slope and intercept
slope = (ysource  Ynew)/(xsource  Xnew[0])
intercept = ysource  (slope*xsource)
if (slope != np.inf) and (intercept != np.inf):
break
else:
continue
#print(source, target)
# randomly select 50 new values along the slope between xsource and xnew (monotonically decreasing/increasing)
XNewList = [xsource]
if xsource < Xnew:
differences = Xnew[0]  xsource
increment = differences / divisor
newXval = xsource
for i in range(divisor):
newXval += increment
XNewList.append(int(newXval))
else:
differences = xsource  Xnew[0]
decrement = differences / divisor
newXval = xsource
for i in range(divisor):
newXval = decrement
XNewList.append(int(newXval))
# determine the values of y, from the new values of x, using y= mx + c
yNewList = []
for i in XNewList:
findy = (slope * i) + intercept # y = mx + c
yNewList.append(int(findy))
ListOfXsourceYSource = [(x, y) for x, y in zip(XNewList, yNewList)]
return XNewList, yNewList
Here is how this works once the networks are initialized (Network A for Agent A and Network B for Agent B):
# Testing
net = Network(150, 450, 100, 600)
NetworkA = net.network(300, ysource=100, Ynew=600) # Network A
NetworkB = net.network(200, ysource=600, Ynew=100) # Network B
Each network is bounded by the directions of ball movement. Network A represents Agent A, which defines the movement of the ball from Agent A to any position between 100 and 300 along the xaxis at Agent B. This also applies to Network B (Agent B).
When the network is started, the .network method discretely generates 50 ypoints (between y1 = 100 and y2 = 600), and corresponding xpoints (between x1 which happens to be the location of the ball from Agent A to a randomly selected point x2 on Agent B side) for network A. This also applies to Network B (Agent B).
To automate the movement of each agent, the opposing agent has to move in a corresponding direction with respect to the ball. This can only be done by setting the x position of the ball to be the x position of the opposing agent, as in the code below.
playerax = ballx #When Agent A plays.
playerbx = ballx #When Agent B plays.
Meanwhile the source agent has to move back to its default position from its current position. The code below illustrates this.
def DefaultToPosition(x1, x2=300, divisor=50):
XNewList = []
if x1 < x2:
differences = x2  x1
increment = differences / divisor
newXval = x1
for i in range(divisor):
newXval += increment
XNewList.append(int(np.floor(newXval)))
else:
differences = x1  x2
decrement = differences / divisor
newXval = x1
for i in range(divisor):
newXval = decrement
XNewList.append(int(np.floor(newXval)))
return XNewList
Now, to make the agents play with each other recursively, this has to run in a loop. After every 50 counts (50 frame display of the ball), the opposing player is made the next player. The code below puts all of it together in a loop.
def main():
while True:
display()
if nextplayer == 'A':
# playerA should play
if count == 0:
#playerax = lastxcoordinate
NetworkA = net.network(
lastxcoordinate, ysource=100, Ynew=600) # Network A
out = DefaultToPosition(lastxcoordinate)
# update lastxcoordinate
bally = NetworkA[1][count]
playerax = ballx #When Agent A plays.
count += 1
# soundObj = pygame.mixer.Sound('sound/sound.wav')
# soundObj.play()
# time.sleep(0.3)
# soundObj.stop()
else:
ballx = NetworkA[0][count]
bally = NetworkA[1][count]
playerbx = ballx
playerax = out[count]
count += 1
# let playerB play after 50 new coordinate of ball movement
if count == 49:
count = 0
nextplayer = 'B'
else:
nextplayer = 'A'
else:
# playerB can play
if count == 0:
#playerbx = lastxcoordinate
NetworkB = net.network(
lastxcoordinate, ysource=600, Ynew=100) # Network B
out = DefaultToPosition(lastxcoordinate)
# update lastxcoordinate
bally = NetworkB[1][count]
playerbx = ballx
count += 1
# soundObj = pygame.mixer.Sound('sound/sound.wav')
# soundObj.play()
# time.sleep(0.3)
# soundObj.stop()
else:
ballx = NetworkB[0][count]
bally = NetworkB[1][count]
playerbx = out[count]
playerax = ballx
count += 1
# update lastxcoordinate
# let playerA play after 50 new coordinate of ball movement
if count == 49:
count = 0
nextplayer = 'A'
else:
nextplayer = 'B'
# CHECK BALL MOVEMENT
DISPLAYSURF.blit(PLAYERA, (playerax, 50))
DISPLAYSURF.blit(PLAYERB, (playerbx, 600))
DISPLAYSURF.blit(ball, (ballx, bally))
# update last coordinate
lastxcoordinate = ballx
pygame.display.update()
fpsClock.tick(FPS)
for event in pygame.event.get():
if event.type == QUIT:
pygame.quit()
sys.exit()
return
And this is basic modelfree reinforcement learning. Itâs modelfree because you need no form of learning or modelling for the 2 agents to play simultaneously and accurately.
Tennis game using Deep Q Network – modelbased Reinforcement Learning
A typical example of modelbased reinforcement learning is the Deep Q Network. Source code to this work is available here.
The code below illustrates the Deep Q Network, which is the model architecture for this work.
from keras import Sequential, layers
from keras.optimizers import Adam
from keras.layers import Dense
from collections import deque
import numpy as np
class DQN:
def __init__(self):
self.learning_rate = 0.001
self.momentum = 0.95
self.eps_min = 0.1
self.eps_max = 1.0
self.eps_decay_steps = 2000000
self.replay_memory_size = 500
self.replay_memory = deque([], maxlen=self.replay_memory_size)
n_steps = 4000000 # total number of training steps
self.training_start = 10000 # start training after 10,000 game iterations
self.training_interval = 4 # run a training step every 4 game iterations
self.save_steps = 1000 # save the model every 1,000 training steps
self.copy_steps = 10000 # copy online DQN to target DQN every 10,000 training steps
self.discount_rate = 0.99
self.skip_start = 90 # Skip the start of every game (it's just waiting time).
self.batch_size = 100
self.iteration = 0 # game iterations
self.done = True # env needs to be reset
self.model = self.DQNmodel()
return
def DQNmodel(self):
model = Sequential()
model.add(Dense(64, input_shape=(1,), activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(10, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer=Adam(lr=self.learning_rate))
return model
def sample_memories(self, batch_size):
indices = np.random.permutation(len(self.replay_memory))[:batch_size]
cols = [[], [], [], [], []] # state, action, reward, next_state, continue
for idx in indices:
memory = self.replay_memory[idx]
for col, value in zip(cols, memory):
col.append(value)
cols = [np.array(col) for col in cols]
return (cols[0], cols[1], cols[2].reshape(1, 1), cols[3],cols[4].reshape(1, 1))
def epsilon_greedy(self, q_values, step):
self.epsilon = max(self.eps_min, self.eps_max  (self.eps_maxself.eps_min) * step/self.eps_decay_steps)
if np.random.rand() < self.epsilon:
return np.random.randint(10) # random action
else:
return np.argmax(q_values) # optimal action
In this case, we need a policy network to control the movement of each agent as they move along the xaxis. Since the values are continuous, that is from (x1 = 100 to x2 = 300), we canât have a model that predicts or works with 200 states.
To simplify this problem, we can split x1 and x2 into 10 states / 10 actions, and define an upper and lower bound for each state.
Note that we have 10 actions, because from a state there are 10 possibilities.
The code below illustrates the definition of both upper and lower bounds for each state.
def evaluate_state_from_last_coordinate(self, c):
"""
cmax: 450
cmin: 150
c definately will be between 150 and 450.
state0  (150  179)
state1  (180  209)
state2  (210  239)
state3  (240  269)
state4  (270  299)
state5  (300  329)
state6  (330  359)
state7  (360  389)
state8  (390  419)
state9  (420  450)
"""
if c >= 150 and c <= 179:
return 0
elif c >= 180 and c <= 209:
return 1
elif c >= 210 and c <= 239:
return 2
elif c >= 240 and c <= 269:
return 3
elif c >= 270 and c <= 299:
return 4
elif c >= 300 and c <= 329:
return 5
elif c >= 330 and c <= 359:
return 6
elif c >= 360 and c <= 389:
return 7
elif c >= 390 and c <= 419:
return 8
elif c >= 420 and c <= 450:
return 9
The Deep Neural Network (DNN) used experimentally for this work is a network of 1 input (which represents the previous state), 2 hidden layers of 64 neurons each, and an output layer of 10 neurons (binary selection from 10 different states). This is shown below:
def DQNmodel(self):
model = Sequential()
model.add(Dense(64, input_shape=(1,), activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(10, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer=Adam(lr=self.learning_rate))
return model
Now that we have a DQN model that predicts the next state/action of the model, and the Pytennis environment already sorted out the ball movement in a straight line, letâs go ahead and write a function that carries out an action by an agent, based on the DQN model prediction regarding itâs next state.
The detailed code below illustrates how agent A makes a decision on where to direct the ball (on Agent Bâs side and viceversa). This code also evaluates agent B, if it was able to receive the ball.
def randomVal(self, action):
"""
cmax: 450
cmin: 150
c definately will be between 150 and 450.
state0  (150  179)
state1  (180  209)
state2  (210  239)
state3  (240  269)
state4  (270  299)
state5  (300  329)
state6  (330  359)
state7  (360  389)
state8  (390  419)
state9  (420  450)
"""
if action == 0:
val = np.random.choice([i for i in range(150, 180)])
elif action == 1:
val = np.random.choice([i for i in range(180, 210)])
elif action == 2:
val = np.random.choice([i for i in range(210, 240)])
elif action == 3:
val = np.random.choice([i for i in range(240, 270)])
elif action == 4:
val = np.random.choice([i for i in range(270, 300)])
elif action == 5:
val = np.random.choice([i for i in range(300, 330)])
elif action == 6:
val = np.random.choice([i for i in range(330, 360)])
elif action == 7:
val = np.random.choice([i for i in range(360, 390)])
elif action == 8:
val = np.random.choice([i for i in range(390, 420)])
else:
val = np.random.choice([i for i in range(420, 450)])
return val
def stepA(self, action, count=0):
# playerA should play
if count == 0:
self.NetworkA = self.net.network(
self.ballx, ysource=100, Ynew=600) # Network A
self.bally = self.NetworkA[1][count]
self.ballx = self.NetworkA[0][count]
if self.GeneralReward == True:
self.playerax = self.randomVal(action)
else:
self.playerax = self.ballx
# soundObj = pygame.mixer.Sound('sound/sound.wav')
# soundObj.play()
# time.sleep(0.4)
# soundObj.stop()
else:
self.ballx = self.NetworkA[0][count]
self.bally = self.NetworkA[1][count]
obsOne = self.evaluate_state_from_last_coordinate(
int(self.ballx)) # last state of the ball
obsTwo = self.evaluate_state_from_last_coordinate(
int(self.playerbx)) # evaluate player bx
diff = np.abs(self.ballx  self.playerbx)
obs = obsTwo
reward = self.evaluate_action(diff)
done = True
info = str(diff)
return obs, reward, done, info
def evaluate_action(self, diff):
if (int(diff) <= 30):
return True
else:
return False
From the code above, function stepA gets executed when AgentA has to play. While playing, AgentA uses the next action predicted by DQN to estimate the target (x2 position, at Agent B, from the current position of the ball, x1, which is on itâs own side), by using the ball trajectory network developed by the Pytennis environment to make its own move.
Agent A, for example, is able to get a precise point x2 on Agentâs B side by using the function randomVal, as shown above, to randomly select a coordinate x2 bounded by the action given by DQN.
Finally, function stepA evaluates the response of AgentB to target point x2 by using the function evaluate_action. The function evaluate_action defines if AgentB should be penalized or rewarded. Just as this is described for AgentA to AgentB, it applies for AgentB to AgentA (same code by different variable names).
Now that we have the policy, reward, environment, states and actions correctly defined, we can go ahead and recursively make the two agents play the game with each other.
The code below shows how turns are taken by each agent after 50 ball displays. Note that for each ball display, the DQN is making a decision on where to toss the ball for the next agent to play.
while iteration < iterations:
self.display()
self.randNumLabelA = self.myFontA.render(
'A (Win): '+str(self.updateRewardA) + ', A(loss): '+str(self.lossA), 1, self.BLACK)
self.randNumLabelB = self.myFontB.render(
'B (Win): '+str(self.updateRewardB) + ', B(loss): ' + str(self.lossB), 1, self.BLACK)
self.randNumLabelIter = self.myFontIter.render(
'Iterations: '+str(self.updateIter), 1, self.BLACK)
if nextplayer == 'A':
if count == 0:
# Online DQN evaluates what to do
q_valueA = self.AgentA.model.predict([stateA])
actionA = self.AgentA.epsilon_greedy(q_valueA, iteration)
# Online DQN plays
obsA, rewardA, doneA, infoA = self.stepA(
action=actionA, count=count)
next_stateA = actionA
# Let's memorize what just happened
self.AgentA.replay_memory.append(
(stateA, actionA, rewardA, next_stateA, 1.0  doneA))
stateA = next_stateA
elif count == 49:
# Online DQN evaluates what to do
q_valueA = self.AgentA.model.predict([stateA])
actionA = self.AgentA.epsilon_greedy(q_valueA, iteration)
obsA, rewardA, doneA, infoA = self.stepA(
action=actionA, count=count)
next_stateA = actionA
self.updateRewardA += rewardA
self.computeLossA(rewardA)
# Let's memorize what just happened
self.AgentA.replay_memory.append(
(stateA, actionA, rewardA, next_stateA, 1.0  doneA))
# restart the game if player A fails to get the ball, and let B start the game
if rewardA == 0:
self.restart = True
time.sleep(0.5)
nextplayer = 'B'
self.GeneralReward = False
else:
self.restart = False
self.GeneralReward = True
# Sample memories and use the target DQN to produce the target QValue
X_state_val, X_action_val, rewards, X_next_state_val, continues = (
self.AgentA.sample_memories(self.AgentA.batch_size))
next_q_values = self.AgentA.model.predict(
[X_next_state_val])
max_next_q_values = np.max(
next_q_values, axis=1, keepdims=True)
y_val = rewards + continues * self.AgentA.discount_rate * max_next_q_values
# Train the online DQN
self.AgentA.model.fit(X_state_val, tf.keras.utils.to_categorical(
X_next_state_val, num_classes=10), verbose=0)
nextplayer = 'B'
self.updateIter += 1
count = 0
# evaluate A
else:
# Online DQN evaluates what to do
q_valueA = self.AgentA.model.predict([stateA])
actionA = self.AgentA.epsilon_greedy(q_valueA, iteration)
# Online DQN plays
obsA, rewardA, doneA, infoA = self.stepA(
action=actionA, count=count)
next_stateA = actionA
# Let's memorize what just happened
self.AgentA.replay_memory.append(
(stateA, actionA, rewardA, next_stateA, 1.0  doneA))
stateA = next_stateA
if nextplayer == 'A':
count += 1
else:
count = 0
else:
if count == 0:
# Online DQN evaluates what to do
q_valueB = self.AgentB.model.predict([stateB])
actionB = self.AgentB.epsilon_greedy(q_valueB, iteration)
# Online DQN plays
obsB, rewardB, doneB, infoB = self.stepB(
action=actionB, count=count)
next_stateB = actionB
# Let's memorize what just happened
self.AgentB.replay_memory.append(
(stateB, actionB, rewardB, next_stateB, 1.0  doneB))
stateB = next_stateB
elif count == 49:
# Online DQN evaluates what to do
q_valueB = self.AgentB.model.predict([stateB])
actionB = self.AgentB.epsilon_greedy(q_valueB, iteration)
# Online DQN plays
obs, reward, done, info = self.stepB(
action=actionB, count=count)
next_stateB = actionB
# Let's memorize what just happened
self.AgentB.replay_memory.append(
(stateB, actionB, rewardB, next_stateB, 1.0  doneB))
stateB = next_stateB
self.updateRewardB += rewardB
self.computeLossB(rewardB)
# restart the game if player A fails to get the ball, and let B start the game
if rewardB == 0:
self.restart = True
time.sleep(0.5)
self.GeneralReward = False
nextplayer = 'A'
else:
self.restart = False
self.GeneralReward = True
# Sample memories and use the target DQN to produce the target QValue
X_state_val, X_action_val, rewards, X_next_state_val, continues = (
self.AgentB.sample_memories(self.AgentB.batch_size))
next_q_values = self.AgentB.model.predict(
[X_next_state_val])
max_next_q_values = np.max(
next_q_values, axis=1, keepdims=True)
y_val = rewards + continues * self.AgentB.discount_rate * max_next_q_values
# Train the online DQN
self.AgentB.model.fit(X_state_val, tf.keras.utils.to_categorical(
X_next_state_val, num_classes=10), verbose=0)
nextplayer = 'A'
self.updateIter += 1
# evaluate B
else:
# Online DQN evaluates what to do
q_valueB = self.AgentB.model.predict([stateB])
actionB = self.AgentB.epsilon_greedy(q_valueB, iteration)
# Online DQN plays
obsB, rewardB, doneB, infoB = self.stepB(
action=actionB, count=count)
next_stateB = actionB
# Let's memorize what just happened
self.AgentB.replay_memory.append(
(stateB, actionB, rewardB, next_stateB, 1.0  doneB))
tateB = next_stateB
if nextplayer == 'B':
count += 1
else:
count = 0
iteration += 1
Comparison/Evaluation
Having played this game modelfree and modelbased, here are some differences that we need to be aware of:
s/n

Modelfree

Modelbased

1 
rewards are not accounted for (since this is automated, reward = 1) 
rewards are accounted for 
2 
no modelling (no decision policy is required) 
modelling is required (policy network) 
3 
this doesnât require the use of initial states to predict the next state 
this requires the use of initial states to predict the next state using the policy network 
4 
the rate of missing the ball with respect to time is zero 
the rate of missing the ball with respect to time approaches zero 
If youâre interested, the videos below show these two techniques in action playing tennis games:
1. Modelfree
2. Modelbased
Conclusion
Tennis might be simple compared to selfdriving cars, but hopefully this example showed you a few things about RL that you didnât know.
The main difference between modelfree and modelbased RL is the policy network, which is required for modelbased RL and unnecessary in modelfree.
Itâs worth noting that oftentimes, modelbased RL takes a massive amount of time for the DNN to learn the states perfectly without getting it wrong.
But every technique has its drawbacks and advantages, choosing the right one depends on what exactly you need your program to do.
Thanks for reading, I left a few additional references for you to follow if you want to explore this topic more.
References
 AlphaGo documentary: https://www.youtube.com/watch?v=WXuK6gekU1Y
 List of reinforcement learning environments: https://medium.com/@mauriciofadelargerich/reinforcementlearningenvironmentscff767bc241f
 Create your own reinforcement learning environment: https://towardsdatascience.com/createyourownreinforcementlearningenvironmentbeb12f4151ef
 Types of RL Environments: https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781838649777/1/ch01lvl1sec14/typesofrlenvironment
 Modelbased Deep Q Network: https://github.com/elishatofunmi/pytennisDeepQNetworkDQN
 Discrete mathematics approach youtube video: https://youtu.be/iUYxZ2tYKHw
 Deep Q Network approach YouTube video: https://youtu.be/FCwGNRiq9SY
 Modelfree discrete mathematics implementation: https://github.com/elishatofunmi/pytennisDiscreteMathematicsApproach
 Handson Machine Learning with scikitlearn and TensorFlow: https://www.amazon.com/HandsMachineLearningScikitLearnTensorFlow/dp/1491962291