Deep Reinforcement Learning for Playing Doom — Part 2: Increasing complexity

This is the second part of a series dedicated to practicing reinforcement learning (RL) by teaching an agent to play various Doom scenarios. In the first part, which you can find here, we focused on presenting the setup and we solved our first RL task by training an agent to move and shoot in a simple environment. There is a Jupyter notebook which covers in details the concepts seen here.

We will build upon the setup introduced in the first part and gradually start solving more difficult scenarios. By the end of the article, we will end up with a full-fledged deathmatch environment that we can use for our learning task.

Defend The Center

One interesting scenario in the VizDoom repo that is slightly more difficult than the basic scenario seen so far is called “defend the center”. In this scenario, the agent is stuck in the middle of a circular room with enemies appearing regularly at random locations around him. The agent receives one point for every enemy killed. The scenario inevitably reaches a conclusion because the agent only has a limited amount of ammunition: 26. The maximum theoretical reward is therefore 25 (one kill per bullet minus one point for dying). The actions that can be performed are ATTACK, TURN_LEFT and TURN_RIGHT.

Remember that in the previous part we created an environment wrapper for a Doom game instance. Using the wrapper, we can write a small helper function to ease the creation of the vectorized environments and the training process.

Using this snippet to solve the new scenario allows us to quickly reach an average reward of over 20 without breaking a sweat. Let’s increase the complexity, shall we?

Trained model playing on “defend the center”.

Defend The Center (revisited)

In a deathmatch environment, the difficulty will come from the fact that the agent will need to navigate through the map to find enemies, ammunition and pick up medkits to recover health. To work our way towards a scenario where the agent is able to freely move through a map, we will modify “defend the center “ by adding new actions in the scenario config file. The new list of buttons we will be using is:


Simply adding more buttons would certainly work. However, you might notice that the agent is quite slow in the sense that he can only perform one action at a time. Either shoot, turn or move.

Although this is not a big issue in the current scenario, when playing against bots running all over the place our agent will be unfairly disadvantaged because of the limitations in its actions. This is aggravated by the fact that in Doom, players can actually combine forward and lateral movements to run at a velocity higher than what is possible in either direction alone. This is called straferunning. In the animation below, we can clearly see the times where the agent must stop its progression to make a turn.

Trained agent navigating using one button at a time (6 outputs).
Trained agent navigating using movement combinations (18 outputs).

Our model currently uses a discrete action space which cannot output multiple actions at the same time. A simple workaround is to generate all possible combination of buttons and represent each combination as a single action. Unfortunately, the number of such possible combinations increase exponentially with the number of available buttons. Even is our simple case with only 6 buttons, this would mean having 2⁶ = 64 possible outputs. To mitigate the combinatorial explosion of our action space, we are going to forbid certain combinations. Indeed, it does not make much sense to allow for example both TURN_RIGHT and TURN_LEFT to be activated at the same time. Therefore, we will remove the following combinations:


We will also only allow the ATTACK button to be used alone. This will not hurt performance as most weapons in Doom have a “cooldown” period between two attacks anyway. This will give us a grand total of 18 actions which is much better than the 64 we could have ended up with. Don’t hesitate to take a look at the notebook to see the implementation details. As you can see from the comparison above, an agent trained using the modified 18-output version navigates across the map much more dynamically. We can hopefully expect better results when competing in a deathmatch.

Using our new combination of actions we can now start solving the variant of “defend the center” where the agent must not only eliminate enemies, but do so while having the freedom to move around! Training the model in this more challenging environment will definitely require additional training time. Typically, solving the more difficult variant will require around five times as many steps as the easy variant. Nonetheless, this allows us to get closer to our objective since the last piece missing now is adding bots!

Monitoring the model

Since training is becoming increasingly difficult, this is a good opportunity to log some metrics to help us monitor the whole process. Stable-baselines already logs several useful metrics for us such as the losses and the episode rewards. The next most useful quantity to keep an eye on is the evolution of the model weights and activations. In particular, we’d like to make sure that our activations are well behaved. That is, they are not always zero (which would suggest we have dead neurons) and that the variance of each layer output stays away from zero as well. If this is not the case, we might end up with vanishing gradient issues and ultimately poor learning performance.

Let’s start by tracking neuron activations. This is particularly important for the first layers of our model. To do this, we will use FastAI’s approach of defining hook wrappers. The approach is illustrated in their amazing online course about deep learning during lesson 10. A hook is simply a function that can be attached to a specific layer of our model and that will be called for each forward pass. This is a feature offered by PyTorch and the idea is to build on top of that to keep track of useful statistics.

We will combine this approach with Tensorboard’s ability to log tensors to get the distribution over time of our layer activations. Our hooks will contain a single attribute, activation_data, that will store the layer output.

The next step is to define a callback that will periodically log hook data. For simplicity, we will only be logging to Tensorboard the latest activation value of each training phase.

Of course, if you need more granularity, you can log more values. Keep in mind though that stable-baselines’ logger implementation aligns everything on the rollout time steps (which do not correspond to the number of forward passes in training). Therefore, you will need to either log to a file yourself or adapt stable-baseline’s logger implementation. You can also take a look at the plotting helpers which make use of the complete list of activations to create the diagnostic plots below.

The callback will also compute and log some basic statistics:

  • The mean of each activation layer
  • The standard deviation of each activation layer
  • The proportion of activations that are between -0.2 and 0.2

Let’s observe the behavior of the model using our new logging mechanism. We train the model for a few steps and let the hook collect the activations during the forward passes.

Using the idea presented in FastAI’s course, we will plot the evolution of activation statistics as well as histograms of layer activations. The horizontal axis represents training steps (one step being corresponding to processing one minibatch), in chronological order from left to right. The vertical axis represents the histogram bins of activation values from -7 to 7 (bottom to top). Brighter regions indicate larger histogram frequencies.

Layer monitoring output for default model.

We notice at the beginning of the training phase several potential issues. First, the variance of each layer is very close to zero and there is a noticeable decrease between the first and the last layer. Also, almost all the outputs are near zero which can be seen both on the graph showing the proportion of small activations over time (last graph on the left) and the activation histograms (all graphs on the right).

In addition to using LeakyReLU instead of ReLU, we will perform two changes according to FastAI’s recommendation. The goal is to bring variance under control and hopefully increase the range of values produced by the activation function:

  1. Switch to Kaiming initialization
  2. Use some normalization (layer norm in our case)

Check out the notebook to see the implementation details, especially the section about normalization. Logging the first few steps of training with the the modified model produces the following results:

Layer monitoring output for customised model.

We notice now that the variance is much better behaved, hovering around 0.6. The proportion of activations that are close to zero is significantly lower at ~65% instead of nearly 90% before. In particular, we can notice that the model is now using the full range of the ReLU activation function.

The metrics associated with our model look much better now. But does this translate into better performance? In FastAI’s example, adding batch normalization was the most impactful factor in improving the results. To see whether we get similar benefits in our setup, we can try solving the more difficult variant of “defend the center” with different parameters: no normalization, layer normalization and batch normalization. In the plot below, each line is the average of 6 consecutive trials. The colored region denotes the error mean. Unfortunately, although our modified model exhibits nicer telemetry, it seems to have very little impact on the training performance. We will keep the changes nonetheless since we know they help with the fundamentals.

Playing against bots

We are now ready to train a model to play Doom deathmatch! All we need to add are some bots to compete with our agent.

A game of deathmatch consists (in our case) in a session of 2:30 minutes during which 8 bots in addition to our agent will fight against each other in order to get the most frags. There are no teams. A “frag” corresponds to an enemy killed. Each player starts with a pistol and a couple of seconds of invincibility. As soon as a player dies, it is respawned at one of the respawn points of the map picked uniformly at random.

The actions available are the same as in the more difficult variant of “defend the center”. That is, the agent can move, turn and shoot according to the rules we defined earlier.

Deathmatch-friendly map

To play in deathmatch mode, we need a proper map with items like ammunition, health and respawn points. Using Slade, I created a simple map to train the agent on. I tried to design the map in such a way as to:

  • Make it easy for the agent to navigate through. Therefore, no intricate architecture.
  • The agent should not be able to play effectively without moving. This means that there should not be open lines of fire across wide areas of the map.
  • No extravagant number of weapons and ammunition pick-up locations. This is to require some baseline level of effort to obtain items instead of just picking them out of luck by randomly running around.

Players can increase their firepower by picking up one of the four shotguns scattered across the map. This provides them a big advantage compared to just using the pistol. Ammunition and health kits are also available. The map is contained in the deathmatch_simple_v1.wad, next to the corresponding deathmatch_simple_v1.cfg scenario config. The screenshot below shows an aerial view of the map.

Custom map created for the deathmatch environment.

If you are interested in modifying the map or even creating your own, the following resources will be helpful to you. With the basics concepts in mind you should be able to quickly create maps for your RL experiments. The tools are easier to use than one might expect.

  • Slade tutorial: Useful for a first approach of the tool but the contents are rather scant in my opinion.
  • Doom mapping tutorial: Much more comprehensive than the previous resource.
  • WAD file description: Useful for understanding how the resources are structured and how to interpret what Slade’s UI is showing you.
  • GZDoomBuilder: An advanced Doom map editor, much richer in features and less susceptible to bugs and crashes in my experience.

Modified environment

Setting up a deathmatch game requires a slightly different config for VizDoom as we need to tell it to host a game with a couple of multiplayer-specific parameters. VizDoom’s GitHub provides additional examples to set up multiplayer games. Here are the main parameters with their descriptions:

We also need to perform small adaptations to the Doom environment wrapper written in in the first part of the series to account for the deathmatch mechanic. In particular, we need to keep track of the “frags” obtained by our agent as this will constitute the basis for the rewards.

Indeed, the rewards here will be directly derived from the frags obtained by the agent: 1 frag = 1 point. Note that in the case of a suicide (which could happen if we added rocket launchers to the map for example), the frag count would decrease by one. Conveniently, this also works as a penalty to discourage the agent of ending with such outcomes.

The code below shows the modified version of the step and reset methods. The main difference is that now we compute the rewards ourselves instead of collecting the values returned by VizDoom. Removing or adding the bots is done by sending commands to the Doom game instance via the appropriately named send_game_command method.

With these changes, we are ready to start the learning process using the deathmatch environment. Take a look at the notebook to understand how everything is tied up together.

How does our agent performs in this new environment? Plotting the average reward over time depicts an unfortunate situation: even though there seems to be some improvement over time, the performance of our model is really not stellar. Even after 2 million steps, the agent barely reaches more than 2 frags per match on average.

Average reward (frags) as training progresses.

Compare that to the best bot which manages to get around 13 frags by the end of each game and our objective seem still pretty far away. You can use the following snippet to print the scores at the end of an episode.

Which would provide an output along the lines of:

- AGENT: 2 frags
- Conan: 3 frags
- Anderson: 3 frags
- MacGyver: 4 frags
- Blazkowicz: 5 frags
- Plissken: 6 frags
- Jones: 8 frags
- T800: 8 frags
- Machete: 11 frags


Notice that “defend the center” and the “deathmatch” environment are very similar; same inputs, same outputs. However, while we managed to solve the former in a reasonable amount of time, we barely improved at all in the second scenario. Despite all the similarities, the learning process did not go as smoothly as one could have expected. Why is that?

The issue lies in the fact that the rewards in the “deathmatch” scenario are very sparse. In other words, only a few combinations of state and action generate useful signals that our agent can use for learning. Indeed, to obtain a frag, the agent has to perform several steps “just right”; it has to navigate through the map, aim at enemies and repeatedly shoot them until eventually their health reaches zero. This means that it is quite unlikely that an agent performs by chance the sequence of actions required to reach a good outcome. Thus, with rare rewards, reinforcing the right behaviors will take a very long time.

A possible solution to this issue is to perform “reward shaping”. The idea is simple: give small positive rewards to actions that we believe will help the agent progress towards its main objective (increasing its frag count). For example, we can give rewards proportional to damage inflicted to enemies or proportional to the ammo and health collected.

Another option that can help with the learning process is to design some “learning curriculum”. Instead of directly trying to learn in a very challenging environment such as the one we have constructed, we can simplify the task early in the learning process. Then, difficulty is gradually increased. To implement this in our situation we could for example reduce the speed and health of enemies. This would make it easier for our agent to obtain positive rewards.

We will implement both ideas and even more in the next part of this series. This will allow us to vastly improve the learning performance and finally start winning some deathmatch games. Stay tuned!

Data Scientist at Swisscom, transforming network interactions into actionable insights.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store