Deep Reinforcement Learning for Playing Doom — Part 1: Getting Started
Reinforcement learning (RL) is a hot topic right now thanks to self-driving cars and (super)human-like performance in games like Go or Dota. However, approaching the subject can be intimidating. I had been wanting to give it a shot for quite some time but didn’t really know where to start. After a couple of initial unfruitful attempts, I started getting some gratifying results, having fun and learning a lot in the process. With the right tools and resources, it’s actually a lot easier than it looks. So come along if you are interested in applying RL in an exciting environment such as a 3D shooter!
In this series of articles, we’ll be seeing how to train a reinforcement learning agent to play various Doom scenarios. We will not be focused on implementing the algorithms themselves as there are already tons of great tutorials out there; I’ve linked some of my favorite sources in the next paragraph. Instead, we will see how to apply state-of-the-art implementations to a challenging environment. This will give us the opportunity to explore different aspect of RL like curriculum learning, reward shaping but also machine learning in general; how to monitor a model, how to fix issues etc. The ultimate goal will be to train an RL agent to play deathmatch against in-game bots. By the end of the series, you should have a result similar to the agent seen playing in the animation below.
This article is written with the purpose of providing tips and tricks that I wish I knew before starting an RL project. There is a corresponding notebook that you can use to follow along. It has more detailed explanations about design choices and code quirks. The code is pretty self-contained so you should be able to run it on your machine and understand what it’s doing in no time.
There are not many prerequisites to follow along. However, a fair understanding of Python and machine learning concepts is nonetheless needed. That is, if you have never experimented with neural networks before, you might want to get up to speed with that before tackling reinforcement learning tasks. Here are some pointers to good resources available for free:
- OpenAI Spinning Up. A well-written guide that does a great job at explaining the key equations in RL. Also, you will find there more details about the main ideas behind popular algorithms such as DDPG, TRPO, PPO etc.
- Thomas Simonini has a great series of courses on RL where he goes into the details of actually implementing the RL algorithms. He uses tons of different environment, including the one we will be using today.
- Stable-baselines documentation and examples is also a great way to learn RL. They have notebook that can serve as a good starting point for further experimentation. Delving inside the code can be very insightful and the good structure and comments makes it very easy to do.
- David Silver’s course on RL, this is where you want to go for a formal explanation of RL.
- Last but not least, I found FastAI’s course to be one of the best resources out there about deep learning in general. Every concept is illustrated live with code and demonstrations which allow for a very practical approach to machine learning. I love it.
Setting up the environment
Let start with a quick tour of the different tools we’ll be using:
- Stable-baselines3, a PyTorch rewrite stable-baselines2. This is a library offering state-of-the-art implementation of various popular RL algorithms.
- VizDoom, an RL-friendly version of Doom that will let us access game state and perform actions programmatically.
Stable-baselines is a fork of OpenAI’s baselines repository. They have a comprehensive Medium article showcasing the numerous features and the reasons that led them to fork the original code. It’s an awesome library, the code is well documented and easy to follow. Most of the boilerplate code has been abstracted away which means it’s super easy to get up and running with a few lines of code.
To reduce the chance of incompatibilities, here is a one-liner that will install everything we need. Feel free to pick more recent versions but do so at your own risk!
pip install stable-baselines3==0.10.0 vizdoom==1.1.8 torch==1.7.1 gym==0.17.3 tensorboard==2.4.0
Note that if you are using a Windows machine, the installation process of VizDoom requires a couple more steps that are explained here . Essentially, you need to download the binaries corresponding to the version of Python you are using, unzip the contents and move the vizdoom folder to your site-packages manually.
Getting familiar with VizDoom
VizDoom has a very complete tutorial on their main page as well as nice example section in their GitHub to help get you started. Nonetheless, we will go through the main aspects you need to know to get started as fast as possible. We will explore more of what VizDoom has to offer in another article, where we will tinker with Doom scripts to influence the learning process.
VizDoom works with scenarios. Each scenario describes a different situation with its own rules and rewards. A scenario is usually composed of two files:
- A .cfg file which parametrize the learning environment by defining for example which buttons are available to our agent or what is the screen input format.
- A WAD file which is a binary file that contains the maps and other resources specific to Doom.
Here’s an example of a very simple config file:
# Resources selection
doom_scenario_path = basic.wad
doom_map = map01# Episode finishes after 300 actions
episode_timeout = 300# Rewards
living_reward = -1 # Rendering options
screen_format = RGB24
screen_resolution = RES_320X240
# Available actions
mode = PLAYER
doom_skill = 5
In the beginning, we state which resources we will be using for the scenario as well as how long each episode can last (at most). Then, we specify some basic reward for our RL task. Here, each step where the agent is alive will deduct one point from its total reward (we’ll see why shortly).
The next section configures the screen buffer parameters and finally we specify which buttons we would like to make available to our agent. We will use this information to build our action space. The remaining lines are not crucial to understand at this stage. I invite you to read through the documentation to discover all the different keys and values that can be set in those config files.
Before moving on to the next part, let’s make sure that your setup is working properly. The following snippet should have all the necessary code to instantiate the basic scenario and start making some random actions. In this scenario the agent can either move left, move right or attack.
If you execute this code, you should end up with something like this. Note that your version will have different-looking characters and textures as you will be running the game with FreeDoom, which is a free set of resources for ZDoom. This can be modified by obtaining the official doom IWAD file.
Solving our first scenario
As a refresher, remember that in a reinforcement learning problem we are trying to teach an agent to perform a given task by iteratively analyzing and improving its own behavior. To guide the learning process, the environment will reward the agent for every action taken and the agent will adapt its behavior so as to maximize the total reward it receives.
The first scenario we will be tackling is called “basic”. The goal is to shoot the enemy as fast as possible while remaining accurate. The rewards are defined as follows:
- +101 for shooting the enemy.
- -1 per time step.
- -5 for missed shots
And the episode ends after 300 time steps. The config file for this scenario is the one we’ve seen in the example. So, there will be three buttons available: move left, move right and attack. You might notice that in the config file we only specified the “time step” reward. This is because the other rewards are defined in the WAD file. Config files only allow you to define living and dying rewards.
We’ll be using Proximal Policy Optimization with an actor-critic policy. If those terms are obscure to you check out Spinning Up’s explanation of PPO and Rudy Gilman’s intuitive illustration of actor-critic methods. Training a PPO agent with stable baselines is ridiculously easy:
from stable_baselines3 import ppo
from stable_baselines3.common import policies
agent = ppo.PPO(policies.ActorCriticCnnPolicy, env)
The only caveat is that stable-baselines expects a Gym environment. A gym is a very simple interface to standardize the interactions between RL agents and various environments. Therefore, we will need to write an adapter to make it work with VizDoom.
The input to our model, or observation space, will always be one (or more) RGB images of shape (height, width, channels) whose pixel values are integers in the range 0 to 255.
The output of our model will be an integer representing the index of the action we wish to perform.
from gym import spacesobservation_space = spaces.Box(
shape=(h, w, c),
dtype=np.uint8)action_space = spaces.Discrete(4)
As mentioned above, we need to provide the agent with a Gym environment. The two functions we need to implement for the interface are:
step(action: int) -> Tuple[np.ndarray, float, bool, Dict]
Whose input represents the action chosen by our agent. It returns a tuple of 4 objects: the new state, the reward obtained by performing the action, a boolean flag indicating whether the episode finished and finally a dict containing various information.
reset() -> Frame:Self-explanatory, resets internal variables and returns a new initial observation.
VizDoom’s equivalent to the
step function is
make_action. This function takes an optional second argument which is the number of game tics for which the action will be applied and returns the total reward obtained for performing the action (Doom runs at 35 tics per second). As we’ll see later, modifying the tic value is actually one of the most efficient way to speed up the learning process.
To know whether the episode finished we can use the appropriately named
is_episode_finished. Finally, we need to return the updated state, which can be done by accessing the frame buffer. We’ll factor that part out since we need the same functionality from the reset function. All-in-all here’s how it could look like:
frame_preprocessor member variable is just an intermediate step that crops/resizes the frame buffer. In the notebook I explain why it is helpful to have it there. The final adapter class is available in full here.
Finally, we can pass our doom game instance to the adapter and wrap the whole thing in a VecEnv, an object that stable-baselines knows how to handle. The use of VecEnv is a bit quirky so if you’re intrigued, check out the notebook where I explain why it’s good to have it.
Starting the learning process
With this adapter it now becomes very easy to solve our first scenario! Here is the main script below. It also includes a callback that will evaluate the model every 25K steps and save it each time it reaches a new reward record.
And just like that, we have a RL agent learning to play doom in a simple scenario. Stable-baselines also does the logging for you if you provide the model with a path using the
tensorboard_log parameter. Just launch tensorboard by specifying the correct path and you can monitor the progress made by your RL agent in real-time!
This will start a Tensorboard session available at port 6006 by default. Visit
127.0.0.1:6006 with your web browser. Within a couple thousand steps you should see your agent start performing really well.
I invite you to take a look at the associated notebook where we go into more details of this setup. In particular, we explore the default architecture used by stable-baselines and look at a couple of simple ways to improve the model.
Looking into the code, you will notice that every action taken by our model is repeated for four tics. This is captured by the
frame_skip parameter. The reason for that is that repeating actions for several frames has a positive effect on learning. This “trick” is quite popular and has been used for example for playing Atari Breakout using DQN. In particular, the authors of VizDoom also use it in their paper.
We can also observe this effect in our simple environment. The following figure shows the difference in the average reward obtained during the process. Each curve represents the average of six runs for a given frame skip value.
Notice how picking a frame skip value too low slows the learning process. My intuitive explanation is that a larger frame skip allows the model to explore more rapidly the environment and thus encounter positive situations more often. However, picking a value too high means the agent will not have the granularity required to make precise actions so there is a sweet spot to be found here. Feel free to test other parameters either for the environment or for the agent.
If you have feedback or questions, please do not hesitate to share! I’m curious to see what other people are doing out there in RL. In the next parts of this series, we will tackle more challenging scenarios, inspect our model to improve it and work our way towards creating an agent that can play competitive deathmatch games. Stay tuned!