Continuous Control With Deep Reinforcement Learning

5 min
8th August, 2023

This time I want to explore how deep reinforcement learning can be utilized e.g. making a humanoid model walk. This kind of task is a continuous control task. A solution to such a task differs from the one you might know and use to play Atari games, like Pong, with e.g. Deep Q-Network (DQN).

I’ll talk about what characterizes continuous control environments. Then, I’ll introduce the actor-critic architecture to you and show the example of the state-of-the-art actor-critic method, Soft Actor-Critic (SAC). Finally, we will dive into the code. I’ll briefly explain how it is implemented in the amazing SpinningUp framework. Let’s go!

What is continuous control?

Meet Humanoid. It is a three-dimensional bipedal robot environment. Its observations are 376-dimensional vectors that describe the kinematic properties of the robot. Its actions are 17-dimensional vectors that specify torques to be applied on the robot joints. The goal is to run forward as fast as possible… and don’t fall over.

The actions are continuously valued vectors. This is very different from a fixed set of possible actions that you might already know from Atari environments. It requires a policy to return not the scores, or qualities, for all the possible actions, but simply return one action to be executed. Different policy output requires a different training strategy which we will explore in the next section.

What is MuJoCo?

MuJoCo is a fast and accurate physics simulation engine aimed at research and development in robotics, biomechanics, graphics, and animation. OpenAI Gym, and the Humanoid environment it includes, utilizes it for simulating the environment dynamics. I wrote the whole post about installing and using it here. We won’t need this for the matter of this post.

Off-policy actor-critic methods

Let’s recap: Reinforcement learning (RL) is learning what to do — how to map situations to actions — to maximize some notion of cumulative reward. RL consists of an agent that, in order to learn, acts in an environment. The environment provides a response to each agent’s action that is fed back to the agent. A reward is used as a reinforcing signal and a state is used to condition the agent’s decisions.

Really the goal is to find an optimal policy. The policy tells the agent how it should behave in whatever state it will find itself. This is the agent’s map to the environment’s objective.

The actor-critic architecture, depicted in the diagram above, divides the agent into two pieces, the Actor and the Critic

• The Actor represents the policy – it learns this mapping from states to actions.
• The Critic represents the Q-function – it learns to evaluate how good each action is in every possible state. You can see that the actor uses the critic evaluations for improving the policy.

Why use such a construct? If you already know Q-Learning (here you can learn about it), you know that training the Q-function can be useful for solving an RL task. The Q-function will tell you how good each action is in any state. You can then simply pick the best action. It’s easy when you have a fixed set of actions, you simply evaluate each and every one of them and take the best!

However, what to do in a situation when an action is continuous? You can’t evaluate every value, you could evaluate some values and pick the best, but it creates its own problems with e.g. resolution – how many values and what values to evaluate? The actor is the answer to these problems. It approximated the argmax operator in the discrete case. It simply is trained to predict the best action we would get if we could evaluate every possible action with the critic. Below we describe the example of Soft Actor-Critic (SAC).

SAC in pseudo-code

SAC’s critic is trained off-policy, meaning it can reuse data collected by the older, less trained policy. The off-policy critic training in lines 11-13 utilizes a very similar technique to that of the DQN, e.g. it uses the target Q-Network to stabilize the training. Being off-policy makes it more sample-efficient than the on-policy methods like PPO because we can construct the experience replay buffer where each collected data sample can be reused for training multiple times – contrary to the on-policy training where data is discarded after only one update!

You can see the critics and the replay buffer being initialized in line 1, alongside the target critics in line 2. We use two critics to fight the overestimation error described in the papers: “Double Q-learning” and “Addressing Function Approximation Error in Actor-Critic”, you can learn more about it here. Then, the data is collected and fed to the replay buffer in lines 4-8. The policy is updated in line 14 and the target networks are updated in line 15.

You could notice that both the critic and the actor updates include some additional log terms. This is the max-entropy regularization that keeps the agent from exploiting its, possibly imperfect, knowledge too much and bonuses exploration of promising actions. If you want to understand it in detail I recommend you read this resource.

Soft actor-critic in code

We will work with the Spinning Up in Deep RL – TF2 implementation framework. There is the installation instruction in the repo README. Note that you don’t have to install MuJoCo for now. We will run an example Soft Actor-Critic agent on the Pendulum-v0 environment from the OpenAI Gym suite. Let’s jump into it!

The Pendulum-v0 environment

Pendulum-v0 is the continuous control environment where:

 Actions The torque of only one joint in one dimension Observations Three-dimensional vectors, where the first two dimensions represent the pendulum position – they are cos and sin of the pendulum angle – and the third dimension is the pendulum angle velocity Goal Spin the pendulum to the straight-up position and remain vertical, with the least angular velocity, and the least effort (torque introduced with the actions)

You can imagine it as the simplified model of more complex robots like Humanoid. Humanoid is created from many similar, but two or three-dimensional, joints.

Training the SAC agent

In the repo, the SAC agent code is here. The core.py file includes the actor-critic models’ factory method and other utilities. The sac.py includes the replay buffer definition and the implementation of the training algorithm presented above. I recommend you look through it and try to map lines of the pseudo-code from above to the actual implementation in that file. Then, check with my list:

• the initialization from lines 1-2 of the pseudo-code is implemented in lines 159179 in the sac.py,
• the main loop from line 3 of the pseudo-code is implemented in line 264 in the sac.py,
• the data collection from lines 4-8 of the pseudo-code is implemented in lines 270295 in the sac.py,
• the update handling from lines 9-11 of the pseudo-code is implemented in lines 298300 in the sac.py,
• the parameters update from lines 12-15 of the pseudo-code is called in line 301 and implemented in lines 192240 in the sac.py,
• and the rest of the code in the sac.py is mostly logging handling and some more boilerplate code.

The example training in the Pendulum-v0 environment is implemented in the run_example.py in the repo root. Simply run it like this: python run_example.py. After the training – or after 200 000 environment steps – the training will automatically finish and save the trained model in the ./out/checkpoint directory.

Below is the example log from the beginning and the end of the training. Note how the AverageTestEpReturn got smaller – from a huge negative number to something closer to zero which is a maximum return. Returns are negative because the agent is penalized for the pendulum not being in the goal position: vertical, with zero angular velocity and zero torque.

The training took 482 seconds (around 8 minutes) on my MacBook with the Intel i5 processor.

Before training

```---------------------------------------
|      AverageEpRet |       -1.48e+03 |
|          StdEpRet |             334 |
|          MaxEpRet |            -973 |
|          MinEpRet |       -1.89e+03 |
|  AverageTestEpRet |        -1.8e+03 |
|      StdTestEpRet |             175 |
|      MaxTestEpRet |       -1.48e+03 |
|      MinTestEpRet |       -1.94e+03 |
|             EpLen |             200 |
|         TestEpLen |             200 |
| TotalEnvInteracts |           2e+03 |
|     AverageQ1Vals |       -4.46e+03 |
|         StdQ1Vals |         7.1e+04 |
|         MaxQ1Vals |           0.744 |
|         MinQ1Vals |           -63.3 |
|     AverageQ2Vals |       -4.46e+03 |
|         StdQ2Vals |        7.11e+04 |
|         MaxQ2Vals |            0.74 |
|         MinQ2Vals |           -63.5 |
|      AverageLogPi |           -35.2 |
|          StdLogPi |             562 |
|          MaxLogPi |            3.03 |
|          MinLogPi |           -8.33 |
|            LossPi |            17.4 |
|            LossQ1 |            2.71 |
|            LossQ2 |            2.13 |
|    StepsPerSecond |        4.98e+03 |
|              Time |             3.8 |
---------------------------------------
```

After training

```---------------------------------------
|      AverageEpRet |            -176 |
|          StdEpRet |            73.8 |
|          MaxEpRet |           -9.95 |
|          MinEpRet |            -250 |
|  AverageTestEpRet |            -203 |
|      StdTestEpRet |            55.3 |
|      MaxTestEpRet |            -129 |
|      MinTestEpRet |            -260 |
|             EpLen |             200 |
|         TestEpLen |             200 |
| TotalEnvInteracts |           2e+05 |
|     AverageQ1Vals |       -1.56e+04 |
|         StdQ1Vals |        2.48e+05 |
|         MaxQ1Vals |           -41.8 |
|         MinQ1Vals |            -367 |
|     AverageQ2Vals |       -1.56e+04 |
|         StdQ2Vals |        2.48e+05 |
|         MaxQ2Vals |           -42.9 |
|         MinQ2Vals |            -380 |
|      AverageLogPi |             475 |
|          StdLogPi |        7.57e+03 |
|          MaxLogPi |            7.26 |
|          MinLogPi |           -10.6 |
|            LossPi |            61.6 |
|            LossQ1 |            2.01 |
|            LossQ2 |            1.27 |
|    StepsPerSecond |        2.11e+03 |
|              Time |             482 |
---------------------------------------
```

Visualizing the trained policy

Now, with the trained model saved, we can run it and see how it does! Run this script:

`python run_policy.py --model_path ./out/checkpoint --env_name Pendulum-v0`

in the repo root. You’ll see your agent playing 10 episodes one after another! Isn’t it cool? Did your agent train to perfectly align the pendulum vertically? Mine not. You may try playing with the hyper-parameters in the run_example.py file (the agent’s function parameters) and make the agent find a better policy. Small hint: I observed that finishing training earlier might help. All the hyper-parameters are defined in the SAC’s docstring in the sac.py file.

You may wonder, why is each episode different? It is because the initial conditions (the pendulum starting angle and velocity) are randomized each time the environment is reset and the new episode starts.

Conclusions

The next step for you is to train SAC in some more complex environment like Humanoid or any other environment from the MuJoCo suite. Installing MuJoCo to Work With OpenAI Gym Environments is the guide I wrote on how to install MuJoCo and get access to these complex environments. It also describes useful diagnostics to track. You can read more about logging these diagnostics in Logging in Reinforcement Learning Frameworks – What You Need to Know. There are also other frameworks that implement algorithms that can solve the continuous control tasks. Read about them in this post: Best Benchmarks for Reinforcement Learning: The Ultimate List. Thank you for your time and see you next time!

Was the article useful?

Thank you for your feedback!