MLOps Blog

Model-Based and Model-Free Reinforcement Learning: Pytennis Case Study

11 min
Elisha Odemakinde
14th November, 2022

Reinforcement learning is a field of Artificial Intelligence in which you build an intelligent system that learns from its environment through interaction and evaluates what it learns in real-time. 

A good example of this is self-driving cars, or when DeepMind built what we know today as AlphaGo, AlphaStar, and AlphaZero. 

AlphaZero is a program built to master the games of chess, shogi and go (AlphaGo is the first program that beat a human Go master). AlphaStar plays the video game StarCraft II.

In this article, we’ll compare model-free vs model-based reinforcement learning. Along the way, we will explore:

  1. Fundamental concepts of Reinforcement Learning
    a) Markov decision processes / Q-Value / Q-Learning / Deep Q Network
  2. Difference between model-based and model-free reinforcement learning.
  3. Discrete mathematical approach to playing tennis – model-free reinforcement learning.
  4. Tennis game using Deep Q Network – model-based reinforcement learning.
  5. Comparison/Evaluation
  6. References to learn more

SEE RELATED ARTICLES

7 Applications of Reinforcement Learning in Finance and Trading
10 Real-Life Applications of Reinforcement Learning
Best Reinforcement Learning Tutorials, Examples, Projects, and Courses

Fundamental concepts of Reinforcement Learning

Any reinforcement learning problem includes the following elements:

  1. Agent – the program controlling the object of concern (for instance, a robot).
  2. Environment – this defines the outside world programmatically. Everything the agent(s) interacts with is part of the environment. It’s built for the agent to make it seem like a real-world case. It’s needed to prove the performance of an agent, meaning if it will do well once implemented in a real world application.
  3. Rewards – this gives us a score of how the algorithm performs with respect to the environment. It’s represented as 1 or 0. ‘1’ means that the policy network made the right move, ‘0’ means wrong move. In other words, rewards represent gains and losses.
  4. Policy – the algorithm used by the agent to decide its actions. This is the part that can be model-based or model-free.

Every problem that needs an RL solution starts with simulating an environment for the agent. Next, you build a policy network that guides the actions of the agent. The agent can then evaluate the policy if its corresponding action resulted in a gain or a loss.

The policy is our main discussion point for this article. Policy can be model-based or model-free. When building, our concern is how to optimize the policy network via policy gradient (PG). 

PG algorithms directly try to optimize the policy to increase rewards. To understand these algorithms, we must take a look at Markov decision processes (MDP).

Markov decision processes / Q-Value / Q-Learning / Deep Q Network

MDP is a process with a fixed number of states, and it randomly evolves from one state to another at each step. The probability for it to evolve from state A to state B is fixed.

A lot of Reinforcement Learning problems with discrete actions are modeled as Markov decision processes, with the agent having no initial clue on the next transition state. The agent also has no idea on the rewarding principle, so it has to explore all possible states to begin to decode how to adjust to a perfect rewarding system. This will lead us to what we call Q Learning.

The Q-Learning algorithm is adapted from the Q-Value Iteration algorithm, in a situation where the agent has no prior knowledge of preferred states and rewarding principles. Q-Values can be defined as an optimal estimate of a state-action value in an MDP. 

It is often said that Q-Learning doesn’t scale well to large (or even medium) MDPs with many states and actions. The solution is to approximate the Q-Value of any state-action pair (s,a). This is called Approximate Q-Learning. 

DeepMind proposed the use of deep neural networks, which work much better, especially for complex problems – without the use of any feature engineering. A deep neural network used to estimate Q-Values is called a deep Q-network (DQN). Using DQN for approximated Q-learning is called Deep Q-Learning.

Difference between model-based and model-free Reinforcement Learning

RL algorithms can be mainly divided into two categories – model-based and model-free.

Model-based, as it sounds, has an agent trying to understand its environment and creating a model for it based on its interactions with this environment. In such a system, preferences take priority over the consequences of the actions i.e. the greedy agent will always try to perform an action that will get the maximum reward irrespective of what that action may cause.

On the other hand, model-free algorithms seek to learn the consequences of their actions through experience via algorithms such as Policy Gradient, Q-Learning, etc. In other words, such an algorithm will carry out an action multiple times and will adjust the policy (the strategy behind its actions) for optimal rewards, based on the outcomes.

Think of it this way, if the agent can predict the reward for some action before actually performing it thereby planning what it should do, the algorithm is model-based. While if it actually needs to carry out the action to see what happens and learn from it, it is model-free.

This results in different applications for these two classes, for e.g. a model-based approach may be the perfect fit for playing chess or for a robotic arm in the assembly line of a product, where the environment is static and getting the task done most efficiently is our main concern. However, in the case of real-world applications such as self-driving cars, a model-based approach might prompt the car to run over a pedestrian to reach its destination in less time (maximum reward), but a model-free approach would make the car wait till the road is clear (optimal way out).

To better understand this, we will explain everything with an example. In the example, we’ll build model-free and model-based RL for tennis games. To build the model, we need an environment for the policy to get implemented. However we won’t build the environment in this article, we’ll import one to use for our program.

Pytennis environment

We’ll use the Pytennis environment to build a model-free and model-based RL system.

A tennis game requires the following:

  1. 2 players which implies 2 agents.
  2. A tennis lawn – main environment.
  3. A single tennis ball.
  4. Movement of the agents left-right (or right-left direction). 

The Pytennis environment specifications are:

  1. There are 2 agents (2 players) with a ball.
  2. There’s a tennis field of dimension (x, y) – (300, 500)
  3. The ball was designed to move on a straight line, such that agent A decides a target point between x1 (0) and x2 (300) of side B (Agent B side), therefore it displays the ball 50 different times with respect to an FPS of 20. This makes the ball move in a straight line from source to destination. This also applies to agent B.
  4. Movement of Agent A and Agent B is bound between (x1= 100, to x2 = 600).
  5. Movement of the ball is bound along the y-axis (y1 = 100 to y2 = 600).
  6. Movement of the ball is bound along the x-axis (x1 = 100, to x2 = 600).

Pytennis is an environment that mimics real-life tennis situations. As shown below, the image on the left is a model-free Pytennis game, and the one on the right is model-based

pytennis model free
pytennis model based

Discrete mathematical approach to playing tennis – model-free Reinforcement Learning

Why “discrete mathematical approach to playing tennis”? Because this method is a logical implementation of the Pytennis environment. 

The code below shows us the implementation of the ball movement on the lawn. You can find the source code here

import time
import numpy as np
import pygame
import sys
#import seaborn as sns

from pygame.locals import *
pygame.init()


class Network:
   def __init__(self, xmin, xmax, ymin, ymax):
       """
       xmin: 150,
       xmax: 450,
       ymin: 100,
       ymax: 600
       """

       self.StaticDiscipline = {
           'xmin': xmin,
           'xmax': xmax,
           'ymin': ymin,
           'ymax': ymax
       }

   def network(self, xsource, ysource=100, Ynew=600, divisor=50):  # ysource will always be 100
       """
       For Network A
       ysource: will always be 100
       xsource: will always be between xmin and xmax (static discipline)
       For Network B
       ysource: will always be 600
       xsource: will always be between xmin and xmax (static discipline)
       """

       while True:
           ListOfXsourceYSource = []
           Xnew = np.random.choice([i for i in range(
               self.StaticDiscipline['xmin'], self.StaticDiscipline['xmax'])], 1)
           #Ynew = np.random.choice([i for i in range(self.StaticDiscipline['ymin'], self.StaticDiscipline['ymax'])], 1)

           source = (xsource, ysource)
           target = (Xnew[0], Ynew)

           #Slope and intercept
           slope = (ysource - Ynew)/(xsource - Xnew[0])
           intercept = ysource - (slope*xsource)
           if (slope != np.inf) and (intercept != np.inf):
               break
           else:
               continue

       #print(source, target)
       # randomly select 50 new values along the slope between xsource and xnew (monotonically decreasing/increasing)
       XNewList = [xsource]

       if xsource < Xnew:
           differences = Xnew[0] - xsource
           increment = differences / divisor
           newXval = xsource
           for i in range(divisor):

               newXval += increment
               XNewList.append(int(newXval))
       else:
           differences = xsource - Xnew[0]
           decrement = differences / divisor
           newXval = xsource
           for i in range(divisor):

               newXval -= decrement
               XNewList.append(int(newXval))

       # determine the values of y, from the new values of x, using y= mx + c
       yNewList = []
       for i in XNewList:
           findy = (slope * i) + intercept  # y = mx + c
           yNewList.append(int(findy))

       ListOfXsourceYSource = [(x, y) for x, y in zip(XNewList, yNewList)]

       return XNewList, yNewList

Here is how this works once the networks are initialized (Network A for Agent A and Network B for Agent B):

# Testing
net = Network(150, 450, 100, 600)
NetworkA = net.network(300, ysource=100, Ynew=600)  # Network A
NetworkB = net.network(200, ysource=600, Ynew=100)  # Network B

Each network is bounded by the directions of ball movement. Network A represents Agent A, which defines the movement of the ball from Agent A to any position between 100 and 300 along the x-axis at Agent B. This also applies to Network B (Agent B).

When the network is started, the .network method discretely generates 50 y-points (between y1 = 100 and y2 = 600), and corresponding x-points (between x1 which happens to be the location of the ball from Agent A to a randomly selected point x2 on Agent B side) for network A. This also applies to Network B (Agent B). 

To automate the movement of each agent, the opposing agent has to move in a corresponding direction with respect to the ball. This can only be done by setting the x position of the ball to be the x position of the opposing agent, as in the code below.

playerax = ballx #When Agent A plays.

playerbx = ballx #When Agent B plays.

Meanwhile the source agent has to move back to its default position from its current position. The code below illustrates this.

def DefaultToPosition(x1, x2=300, divisor=50):
   XNewList = []
   if x1 < x2:
       differences = x2 - x1
       increment = differences / divisor
       newXval = x1
       for i in range(divisor):
           newXval += increment
           XNewList.append(int(np.floor(newXval)))

   else:
       differences = x1 - x2
       decrement = differences / divisor
       newXval = x1
       for i in range(divisor):
           newXval -= decrement
           XNewList.append(int(np.floor(newXval)))
   return XNewList

Now, to make the agents play with each other recursively, this has to run in a loop. After every 50 counts (50 frame display of the ball), the opposing player is made the next player. The code below puts all of it together in a loop.

def main():
   while True:
       display()
       if nextplayer == 'A':
           # playerA should play
           if count == 0:
               #playerax = lastxcoordinate
               NetworkA = net.network(
                   lastxcoordinate, ysource=100, Ynew=600)  # Network A
               out = DefaultToPosition(lastxcoordinate)

               # update lastxcoordinate

               bally = NetworkA[1][count]
               playerax = ballx #When Agent A plays.
               count += 1
#                 soundObj = pygame.mixer.Sound('sound/sound.wav')
#                 soundObj.play()
#                 time.sleep(0.3)
#                 soundObj.stop()
           else:
               ballx = NetworkA[0][count]
               bally = NetworkA[1][count]
               playerbx = ballx
               playerax = out[count]
               count += 1

           # let playerB play after 50 new coordinate of ball movement
           if count == 49:
               count = 0
               nextplayer = 'B'
           else:
               nextplayer = 'A'

       else:
           # playerB can play
           if count == 0:
               #playerbx = lastxcoordinate
               NetworkB = net.network(
                   lastxcoordinate, ysource=600, Ynew=100)  # Network B
               out = DefaultToPosition(lastxcoordinate)

               # update lastxcoordinate
               bally = NetworkB[1][count]
               playerbx = ballx
               count += 1

#                 soundObj = pygame.mixer.Sound('sound/sound.wav')
#                 soundObj.play()
#                 time.sleep(0.3)
#                 soundObj.stop()
           else:
               ballx = NetworkB[0][count]
               bally = NetworkB[1][count]
               playerbx = out[count]
               playerax = ballx
               count += 1
           # update lastxcoordinate

           # let playerA play after 50 new coordinate of ball movement
           if count == 49:
               count = 0
               nextplayer = 'A'
           else:
               nextplayer = 'B'

       # CHECK BALL MOVEMENT
       DISPLAYSURF.blit(PLAYERA, (playerax, 50))
       DISPLAYSURF.blit(PLAYERB, (playerbx, 600))
       DISPLAYSURF.blit(ball, (ballx, bally))

       # update last coordinate
       lastxcoordinate = ballx

       pygame.display.update()
       fpsClock.tick(FPS)

       for event in pygame.event.get():

           if event.type == QUIT:
               pygame.quit()
               sys.exit()
       return

And this is basic model-free reinforcement learning. It’s model-free because you need no form of learning or modelling for the 2 agents to play simultaneously and accurately.

Tennis game using Deep Q Network – model-based Reinforcement Learning

A typical example of model-based reinforcement learning is the Deep Q Network. Source code to this work is available here

The code below illustrates the Deep Q Network, which is the model architecture for this work.

from keras import Sequential, layers
from keras.optimizers import Adam
from keras.layers import Dense
from collections import deque
import numpy as np



class DQN:
   def __init__(self):
       self.learning_rate = 0.001
       self.momentum = 0.95
       self.eps_min = 0.1
       self.eps_max = 1.0
       self.eps_decay_steps = 2000000
       self.replay_memory_size = 500
       self.replay_memory = deque([], maxlen=self.replay_memory_size)
       n_steps = 4000000 # total number of training steps
       self.training_start = 10000 # start training after 10,000 game iterations
       self.training_interval = 4 # run a training step every 4 game iterations
       self.save_steps = 1000 # save the model every 1,000 training steps
       self.copy_steps = 10000 # copy online DQN to target DQN every 10,000 training steps
       self.discount_rate = 0.99
       self.skip_start = 90 # Skip the start of every game (it's just waiting time).
       self.batch_size = 100
       self.iteration = 0 # game iterations
       self.done = True # env needs to be reset




       self.model = self.DQNmodel()

       return



   def DQNmodel(self):
       model = Sequential()
       model.add(Dense(64, input_shape=(1,), activation='relu'))
       model.add(Dense(64, activation='relu'))
       model.add(Dense(10, activation='softmax'))
       model.compile(loss='categorical_crossentropy', optimizer=Adam(lr=self.learning_rate))
       return model


   def sample_memories(self, batch_size):
       indices = np.random.permutation(len(self.replay_memory))[:batch_size]
       cols = [[], [], [], [], []] # state, action, reward, next_state, continue
       for idx in indices:
           memory = self.replay_memory[idx]
           for col, value in zip(cols, memory):
               col.append(value)
       cols = [np.array(col) for col in cols]
       return (cols[0], cols[1], cols[2].reshape(-1, 1), cols[3],cols[4].reshape(-1, 1))


   def epsilon_greedy(self, q_values, step):
       self.epsilon = max(self.eps_min, self.eps_max - (self.eps_max-self.eps_min) * step/self.eps_decay_steps)
       if np.random.rand() < self.epsilon:
           return np.random.randint(10) # random action
       else:
           return np.argmax(q_values) # optimal action

In this case, we need a policy network to control the movement of each agent as they move along the x-axis. Since the values are continuous, that is from (x1 = 100 to x2 = 300), we can’t have a model that predicts or works with 200 states. 

To simplify this problem, we can split x1 and x2 into 10 states / 10 actions, and define an upper and lower bound for each state.

Note that we have 10 actions, because from a state there are 10 possibilities.

The code below illustrates the definition of both upper and lower bounds for each state.

def evaluate_state_from_last_coordinate(self, c):
       """
       cmax: 450
       cmin: 150

       c definately will be between 150 and 450.
       state0 - (150 - 179)
       state1 - (180 - 209)
       state2 - (210 - 239)
       state3 - (240 - 269)
       state4 - (270 - 299)
       state5 - (300 - 329)
       state6 - (330 - 359)
       state7 - (360 - 389)
       state8 - (390 - 419)
       state9 - (420 - 450)
       """
       if c >= 150 and c <= 179:
           return 0
       elif c >= 180 and c <= 209:
           return 1
       elif c >= 210 and c <= 239:
           return 2
       elif c >= 240 and c <= 269:
           return 3
       elif c >= 270 and c <= 299:
           return 4
       elif c >= 300 and c <= 329:
           return 5
       elif c >= 330 and c <= 359:
           return 6
       elif c >= 360 and c <= 389:
           return 7
       elif c >= 390 and c <= 419:
           return 8
       elif c >= 420 and c <= 450:
           return 9

The Deep Neural Network (DNN) used experimentally for this work is a network of 1 input (which represents the previous state), 2 hidden layers of 64 neurons each, and an output layer of 10 neurons (binary selection from 10 different states). This is shown below:

def DQNmodel(self):
       model = Sequential()
       model.add(Dense(64, input_shape=(1,), activation='relu'))
       model.add(Dense(64, activation='relu'))
       model.add(Dense(10, activation='softmax'))
       model.compile(loss='categorical_crossentropy', optimizer=Adam(lr=self.learning_rate))
       return model

Now that we have a DQN model that predicts the next state/action of the model, and the Pytennis environment already sorted out the ball movement in a straight line, let’s go ahead and write a function that carries out an action by an agent, based on the DQN model prediction regarding it’s next state. 

The detailed code below illustrates how agent A makes a decision on where to direct the ball (on Agent B’s side and vice-versa). This code also evaluates agent B, if it was able to receive the ball.

   def randomVal(self, action):
       """
       cmax: 450
       cmin: 150

       c definately will be between 150 and 450.
       state0 - (150 - 179)
       state1 - (180 - 209)
       state2 - (210 - 239)
       state3 - (240 - 269)
       state4 - (270 - 299)
       state5 - (300 - 329)
       state6 - (330 - 359)
       state7 - (360 - 389)
       state8 - (390 - 419)
       state9 - (420 - 450)
       """
       if action == 0:
           val = np.random.choice([i for i in range(150, 180)])
       elif action == 1:
           val = np.random.choice([i for i in range(180, 210)])
       elif action == 2:
           val = np.random.choice([i for i in range(210, 240)])
       elif action == 3:
           val = np.random.choice([i for i in range(240, 270)])
       elif action == 4:
           val = np.random.choice([i for i in range(270, 300)])
       elif action == 5:
           val = np.random.choice([i for i in range(300, 330)])
       elif action == 6:
           val = np.random.choice([i for i in range(330, 360)])
       elif action == 7:
           val = np.random.choice([i for i in range(360, 390)])
       elif action == 8:
           val = np.random.choice([i for i in range(390, 420)])
       else:
           val = np.random.choice([i for i in range(420, 450)])
       return val

   def stepA(self, action, count=0):
       # playerA should play
       if count == 0:
           self.NetworkA = self.net.network(
               self.ballx, ysource=100, Ynew=600)  # Network A
           self.bally = self.NetworkA[1][count]
           self.ballx = self.NetworkA[0][count]

           if self.GeneralReward == True:
               self.playerax = self.randomVal(action)
           else:
               self.playerax = self.ballx


#             soundObj = pygame.mixer.Sound('sound/sound.wav')
#             soundObj.play()
#             time.sleep(0.4)
#             soundObj.stop()

       else:
           self.ballx = self.NetworkA[0][count]
           self.bally = self.NetworkA[1][count]

       obsOne = self.evaluate_state_from_last_coordinate(
           int(self.ballx))  # last state of the ball
       obsTwo = self.evaluate_state_from_last_coordinate(
           int(self.playerbx))  # evaluate player bx
       diff = np.abs(self.ballx - self.playerbx)
       obs = obsTwo
       reward = self.evaluate_action(diff)
       done = True
       info = str(diff)

       return obs, reward, done, info


   def evaluate_action(self, diff):

       if (int(diff) <= 30):
           return True
       else:
           return False

From the code above, function stepA gets executed when AgentA has to play. While playing, AgentA uses the next action predicted by DQN to estimate the target (x2 position, at Agent B, from the current position of the ball, x1, which is on it’s own side), by using the ball trajectory network developed by the Pytennis environment to make its own move. 

Agent A, for example, is able to get a precise point x2 on Agent’s B side by using the function randomVal, as shown above, to randomly select a coordinate x2 bounded by the action given by DQN. 

Finally, function stepA evaluates the response of AgentB to target point x2 by using the function evaluate_action. The function evaluate_action defines if AgentB should be penalized or rewarded. Just as this is described for AgentA to AgentB, it applies for AgentB to AgentA (same code by different variable names).

Now that we have the policy, reward, environment, states and actions correctly defined, we can go ahead and recursively make the two agents play the game with each other. 

The code below shows how turns are taken by each agent after 50 ball displays. Note that for each ball display, the DQN is making a decision on where to toss the ball for the next agent to play.

while iteration < iterations:

           self.display()
           self.randNumLabelA = self.myFontA.render(
               'A (Win): '+str(self.updateRewardA) + ', A(loss): '+str(self.lossA), 1, self.BLACK)
           self.randNumLabelB = self.myFontB.render(
               'B (Win): '+str(self.updateRewardB) + ', B(loss): ' + str(self.lossB), 1, self.BLACK)
           self.randNumLabelIter = self.myFontIter.render(
               'Iterations: '+str(self.updateIter), 1, self.BLACK)

           if nextplayer == 'A':

               if count == 0:
                   # Online DQN evaluates what to do
                   q_valueA = self.AgentA.model.predict([stateA])
                   actionA = self.AgentA.epsilon_greedy(q_valueA, iteration)

                   # Online DQN plays
                   obsA, rewardA, doneA, infoA = self.stepA(
                       action=actionA, count=count)
                   next_stateA = actionA

                   # Let's memorize what just happened
                   self.AgentA.replay_memory.append(
                       (stateA, actionA, rewardA, next_stateA, 1.0 - doneA))
                   stateA = next_stateA

               elif count == 49:

                   # Online DQN evaluates what to do
                   q_valueA = self.AgentA.model.predict([stateA])
                   actionA = self.AgentA.epsilon_greedy(q_valueA, iteration)
                   obsA, rewardA, doneA, infoA = self.stepA(
                       action=actionA, count=count)
                   next_stateA = actionA

                   self.updateRewardA += rewardA
                   self.computeLossA(rewardA)

                   # Let's memorize what just happened
                   self.AgentA.replay_memory.append(
                       (stateA, actionA, rewardA, next_stateA, 1.0 - doneA))

                   # restart the game if player A fails to get the ball, and let B start the game
                   if rewardA == 0:
                       self.restart = True
                       time.sleep(0.5)
                       nextplayer = 'B'
                       self.GeneralReward = False
                   else:
                       self.restart = False
                       self.GeneralReward = True

                   # Sample memories and use the target DQN to produce the target Q-Value
                   X_state_val, X_action_val, rewards, X_next_state_val, continues = (
                       self.AgentA.sample_memories(self.AgentA.batch_size))
                   next_q_values = self.AgentA.model.predict(
                       [X_next_state_val])
                   max_next_q_values = np.max(
                       next_q_values, axis=1, keepdims=True)
                   y_val = rewards + continues * self.AgentA.discount_rate * max_next_q_values

                   # Train the online DQN
                   self.AgentA.model.fit(X_state_val, tf.keras.utils.to_categorical(
                       X_next_state_val, num_classes=10), verbose=0)

                   nextplayer = 'B'
                   self.updateIter += 1

                   count = 0
                   # evaluate A

               else:
                   # Online DQN evaluates what to do
                   q_valueA = self.AgentA.model.predict([stateA])
                   actionA = self.AgentA.epsilon_greedy(q_valueA, iteration)

                   # Online DQN plays
                   obsA, rewardA, doneA, infoA = self.stepA(
                       action=actionA, count=count)
                   next_stateA = actionA

                   # Let's memorize what just happened
                   self.AgentA.replay_memory.append(
                       (stateA, actionA, rewardA, next_stateA, 1.0 - doneA))
                   stateA = next_stateA

               if nextplayer == 'A':
                   count += 1
               else:
                   count = 0

           else:
               if count == 0:
                   # Online DQN evaluates what to do
                   q_valueB = self.AgentB.model.predict([stateB])
                   actionB = self.AgentB.epsilon_greedy(q_valueB, iteration)

                   # Online DQN plays
                   obsB, rewardB, doneB, infoB = self.stepB(
                       action=actionB, count=count)
                   next_stateB = actionB

                   # Let's memorize what just happened
                   self.AgentB.replay_memory.append(
                       (stateB, actionB, rewardB, next_stateB, 1.0 - doneB))
                   stateB = next_stateB

               elif count == 49:

                   # Online DQN evaluates what to do
                   q_valueB = self.AgentB.model.predict([stateB])
                   actionB = self.AgentB.epsilon_greedy(q_valueB, iteration)

                   # Online DQN plays
                   obs, reward, done, info = self.stepB(
                       action=actionB, count=count)
                   next_stateB = actionB

                   # Let's memorize what just happened
                   self.AgentB.replay_memory.append(
                       (stateB, actionB, rewardB, next_stateB, 1.0 - doneB))

                   stateB = next_stateB
                   self.updateRewardB += rewardB
                   self.computeLossB(rewardB)

                   # restart the game if player A fails to get the ball, and let B start the game
                   if rewardB == 0:
                       self.restart = True
                       time.sleep(0.5)
                       self.GeneralReward = False
                       nextplayer = 'A'
                   else:
                       self.restart = False
                       self.GeneralReward = True

                   # Sample memories and use the target DQN to produce the target Q-Value
                   X_state_val, X_action_val, rewards, X_next_state_val, continues = (
                       self.AgentB.sample_memories(self.AgentB.batch_size))
                   next_q_values = self.AgentB.model.predict(
                       [X_next_state_val])
                   max_next_q_values = np.max(
                       next_q_values, axis=1, keepdims=True)
                   y_val = rewards + continues * self.AgentB.discount_rate * max_next_q_values

                   # Train the online DQN
                   self.AgentB.model.fit(X_state_val, tf.keras.utils.to_categorical(
                       X_next_state_val, num_classes=10), verbose=0)

                   nextplayer = 'A'
                   self.updateIter += 1
                   # evaluate B

               else:
                   # Online DQN evaluates what to do
                   q_valueB = self.AgentB.model.predict([stateB])
                   actionB = self.AgentB.epsilon_greedy(q_valueB, iteration)

                   # Online DQN plays
                   obsB, rewardB, doneB, infoB = self.stepB(
                       action=actionB, count=count)
                   next_stateB = actionB

                   # Let's memorize what just happened
                   self.AgentB.replay_memory.append(
                       (stateB, actionB, rewardB, next_stateB, 1.0 - doneB))
                   tateB = next_stateB

               if nextplayer == 'B':
                   count += 1
               else:
                   count = 0

           iteration += 1

Comparison/Evaluation

Having played this game model-free and model-based, here are some differences that we need to be aware of:

s/n
Model-free
Model-based

1

rewards are not accounted for (since this is automated, reward = 1)

rewards are accounted for

2

no modelling (no decision policy is required)

modelling is required (policy network)

3

this doesn’t require the use of initial states to predict the next state

this requires the use of initial states to predict the next state using the policy network

4

the rate of missing the ball with respect to time is zero

the rate of missing the ball with respect to time approaches zero

If you’re interested, the videos below show these two techniques in action playing tennis games:

1. Model-free

2. Model-based

Conclusion

Tennis might be simple compared to self-driving cars, but hopefully this example showed you a few things about RL that you didn’t know. 

The main difference between model-free and model-based RL is the policy network, which is required for model-based RL and unnecessary in model-free. 

It’s worth noting that oftentimes, model-based RL takes a massive amount of time for the DNN to learn the states perfectly without getting it wrong.

But every technique has its drawbacks and advantages, choosing the right one depends on what exactly you need your program to do. 

Thanks for reading, I left a few additional references for you to follow if you want to explore this topic more.

References

  1. AlphaGo documentary: https://www.youtube.com/watch?v=WXuK6gekU1Y
  2. List of reinforcement learning environments: https://medium.com/@mauriciofadelargerich/reinforcement-learning-environments-cff767bc241f
  3. Create your own reinforcement learning environment: https://towardsdatascience.com/create-your-own-reinforcement-learning-environment-beb12f4151ef
  4. Types of RL Environments: https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781838649777/1/ch01lvl1sec14/types-of-rl-environment
  5. Model-based Deep Q Network: https://github.com/elishatofunmi/pytennis-Deep-Q-Network-DQN
  6. Discrete mathematics approach youtube video: https://youtu.be/iUYxZ2tYKHw
  7. Deep Q Network approach YouTube video: https://youtu.be/FCwGNRiq9SY
  8. Model-free discrete mathematics implementation: https://github.com/elishatofunmi/pytennis-Discrete-Mathematics-Approach-
  9. Hands-on Machine Learning with scikit-learn and TensorFlow: https://www.amazon.com/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1491962291