Recap

This is the last article of a three-part series on using deep reinforcement learning (RL) to teach an AI to play Doom. Over the course of the project, we saw how to apply state-of-the-art RL algorithms to solve a fun challenge like playing a first-person shooter. The emphasis was put on how to use existing tools in practice rather than on rewriting the latest algorithms.

The first part focused on establishing a basic setup that allows interacting with a Doom game instance. Thanks to a library called “ViZDoom”, we could unfold the game step by step, collecting game variables like the screen buffer or player status and feed precise inputs like button activations. With a bit of code, we adapted the game’s interface to integrate seamlessly with a popular set of RL algorithm implementations: stable-baselines. In the very first scenario we solved, the agent could only move laterally and had to shoot a static passive enemy:

First scenario being solved by the agent.

The second part expands the initial setup to allow for more complex actions. It progressively introduces more difficult scenarios and shows how to closely monitor the model to ensure that the learning process is progressing smoothly.

Intermediate scenario, used for experiments and model comparisons.

Finally, we created a setup in which the agent could play against programmed bots. That is, non-human players whose actions are determined from a set of predefined rules. After performing a few million training steps, we started seeing some interesting behaviour coming from the agent. However, one substantial limitation of the setup was the scarcity of the reward signal. Indeed, in a naive approach where each enemy killed grants one reward point, obtaining such reward requires a lot of consecutive steps done just right. The agent has to chase and aim at moving enemies while repeatedly shooting them in order to possibly obtain a reward. Also, it needs to do it while avoiding obstacles like walls. Since reinforcement learning tries to increase the likelihood of behaviours that result in positive feedback, the fact that we can only observe rewards from time to time means that the learning process will be very slow.

The plot below shows the average score over 6 consecutive sessions of 3 million training steps. At the end of the training session, it manages on average to get around 4 frags per match. To put this into perspective, the best programmed bot usually attains around 10 frags in the same amount of time. This benchmark is represented by the red dashed line.

Average learning curve compared to typical programmed bot performance.

Fortunately, there are methods that can help with the sparse reward issue. In what follows, we will see how reward shaping and curriculum learning can drastically increase the performance of our agent. You can find a notebook with all the code to run the experiments on your machine. There are comments at regular intervals to help you follow along.

Reward shaping

To help with the issue of sparse signals, we can give our agent small positive rewards for every action we believe is beneficial to the learning process. For example, it is reasonable to think that inflicting damage to enemies will lead to a higher chance of getting a frag. Remember that unless the agent manages to inflict 100 points worth of damage at once, it will not immediately see any positive feedback. Therefore, by introducing a small reward we are “pushing” the agent in the right direction during its learning process.

Similarly, we might think that picking up items laying around like medkits will increase the probability of obtaining a good score. Again, it is very difficult for the agent to directly link the fact that an item was picked up to a higher total reward down the line (which can be thousands of steps away!). By generating some signal early on, we are helping it make that connection.

Here are a few examples of actions we could incentivise as well as the quantities used in the notebook’s implementation:

  • Frags
    1 point as usual.
  • Damage
    0.01 per damage point. Enemies typically have 100 health points and there is a one-to-one correspondence between health points and damage points.
  • Ammunition
    0.02 per unit picked up and -0.01 per unit used.
  • Health
    0.02 per health point gained and -0.01 per health point lost.
  • Movement
    0.00005 per time step if the agent has moved, otherwise a penalty of -0.0025.

I found those values to be useful in a deathmatch setting but you can try playing around with the weights to emphasise certain actions more than others and see what the impact on the agent is.

Training with shaped reward is actually quite simple. We just need to modify the usual environment wrapper and adjust the returned rewards whenever the step function is called. The notebook provides a walkthrough of the code adaptations required. The figure below shows the average reward curve obtained over six consecutive session of 3 million training steps. The effects on the learning process are immediate and truly stunning! Just by tweaking a bit the rewards we managed to get a threefold improvement. In fact, the agent is already stronger than the best programmed bots!

Note here that we are plotting “frags” (player score in a deathmatch) over time and not the reward observed by the agent. Of course, the latter would exhibit an even stronger increase since there are now additional reward sources. When doing this, make sure you are comparing the same quantities!

That’s not the end of it, reward shaping is not the only way of improving the learning performance. Another intuitive concept is the idea of creating a learning curriculum.

Learning curriculum

With a learning curriculum, we start with a simplified version of the task and gradually increase the difficulty as the agent progresses. This idea mimics the learning process of humans; we always start with the basics and work our way towards more difficult tasks.

To implement this idea in our deathmatch environment, we will alter the speed and health of bots based on the recent agent performance. At the beginning of the training session, we will ensure that bots are sluggish and very easy to take down. Then, as the agent becomes better at playing Doom, we gradually increase the speed and strength of bots. To do so, we can keep a rolling mean of the last evaluation episodes and adjust the difficulty according to certain thresholds. The ones used in the notebook are:

  • 10% if the average reward is ≤ 5
  • 20% if the average reward is ≤ 10
  • 40% if the average reward is ≤ 15
  • 60% if the average reward is ≤ 20
  • 80% if the average reward is ≤ 25
  • 100% if the average reward is >25

A multiplier of 10% means that bots have 10% of their usual health and move at only 10% of their normal speed. Once the average reward (computed using the rolling mean) rises above 5, bots will have 20% of their health and speed and so on.

We can’t directly influence the behaviour of bots with Python code. Instead, we need to use the scripting language offered by Doom: ACS. Scripts are located next to map data in .wad files. Remember that you can read and edit those files using Slade. If you are interested in knowing more about ACS, ZDoom’s wiki has a detailed documentation section with more than enough material to get you up and running with the language.

The snippet below shows the few lines of code required to set up a basic curriculum environment. The ENTER and RESPAWN scripts are automatically called when a player respectively joins the level or respawns after being killed. Both consist in a simple call to set_actor_skill which adjusts the speed and health properties. The last script called change_difficulty will enable us to change the difficulty setting and will be called from the Python program.

ACS Script:

#include "zcommon.acs"int difficulty_level = 5;
int speed_levels[6] = {0.1, 0.2, 0.4, 0.6, 0.8, 1.0};
int health_levels[6] = {10, 20, 40, 60, 80, 100};
script 2 ENTER
{
set_actor_skill(ActivatorTID());
}
script 3 RESPAWN
{
set_actor_skill(ActivatorTID());
}
script "change_difficulty" (int new_difficulty_level)
{
difficulty_level = new_difficulty_level;
}
function void set_actor_skill(int actor_id)
{
if (ClassifyActor(actor_id) & ACTOR_BOT ) {
SetActorProperty(
actor_id,
APROP_Speed,
speed_levels[difficulty_level]);
SetActorProperty(
actor_id,
APROP_Health ,
health_levels[difficulty_level]);
}
}

Using Python, we can activate a script by using the send_game_command helper function. This allows us to execute any console command. The one that activates a script is oddly named pukename. It must be followed by the name of the script to be called and the parameters to be passed. For example, if we wanted to set the difficulty level to 1 we would use:

game.send_game_command('pukename change_difficulty 1')

To integrate the curriculum in the learning process, we just need to keep track of the total reward across episodes. In the notebook we implemented a simple rolling mean in the form of a fixed-length queue. At the end of each episode, we check if the agent has met the predefined difficulty threshold and call the change_difficulty script accordingly.

Restarting the training process with this new configuration shows even better results than before! By combining reward shaping and curriculum learning we managed to get a 4x increase in performance compared to the baseline for the same number of training steps! With only 3 million training steps we are already outmatching the best programmed bots by a substantial margin. The only task remaining is to train for a longer period of time. Let’s see how high the limit is.

Training the final model

All the requirements to achieve a good deathmatch score are finally met and we can start training the final model. To celebrate, I have created a fancier map that requires more efforts for players to navigate and find enemies. The screenshot below shows an overhead view of the new map. You can find the corresponding .wad file on the GitHub repository.

Slightly more “advanced” map to put the AI to the test.

Architecture

The final model is based on our custom convolutional neural network from the previous article with a couple of “size” changes:

  • 512 neurons for the first flat fully connected layer.
  • 256 neurons in a fully connected layer for the value net.
  • 256 neurons in a fully connected layer for the action net.

In the repository you might see different dimensions for the input layer. This is due to the usage of frame stacking. However, I have noticed no difference in performance with or without frame stacking. The illustration below shows the model architecture without the layer norms (see the previous article to see how and why norms are applied).

Architecture of the model (layer norms omitted).

Training

To obtain the final model, I trained with reward shaping and curriculum learning as follows:

  1. 10M training steps from scratch using a frame skip parameter of 4.
  2. 10M training steps using the previous result and setting the frame skip parameter to 2.
  3. 10M training steps using the previous result and setting the frame skip parameter to 1.

At the beginning of the training process, the frame skip is set relatively high to speed up learning. Then, it is progressively reduced to improve the aiming accuracy of the agent. The figure below shows the final “learning curves” for this three-step setup. Notice the sharp jump in performance as soon as we allow the agent to skip less frames.

Finally, it is important to note that it is more difficult for the agent to get frags on the new map due to its more complex structure. Therefore, we can’t directly compare the number of frags with the previous map and the plots above.

The best model reaches on average 27 frags per game of 2:30 minutes. This is around one frag every 5.5 seconds! Here is an animation of the resulting agent in action, destroying the competition:

AI playing against 8 programmed bots.

Bonus: play against the agent!

Are you curious to see how your skills compare to the AI? There are two helper scripts in the git repository that will start a deathmatch game instance. The first one, demo_deathmatch.sh, starts a game of deathmatch with 8 bots and a pretrained agent. Use it if you want a demonstration of what the agent can do. The second one, demo_multiplayer.sh, will spawn two instances of Doom, one for a human player and one for the pretrained agent. Each player joins the same deathmatch game with 7 programmed bots. Good luck!

Potential improvements

If you have played a game of deathmatch against a fully trained agent, you will have noticed that it is actually quite hard to keep up in terms of score. However, when playing in a 1 versus 1 setting, it is surprisingly easy to defeat the AI due to its overall lack of strategy. I regrouped three pointers for further improving the model.

Memory

You might have noticed that the agent has no concept of memory. This means that enemies that are not visible on the screen are immediately forgotten by the agent. Also, the AI does not keep track of places that it has already visited. This is not a big issue when playing against 8 programmed bots as there is always an enemy close by. However, when playing against a single opponent, the agent will revisit several times the same location or simply ignore some area of the map it should have explored.

A potential improvement here would be to use a model that has a concept of memory like a LSTM neural network. The following paper shows that such a model could be used to play Doom effectively.

Lample, Guillaume, and Devendra Singh Chaplot. “Playing FPS Games with Deep Reinforcement Learning.” ArXiv:1609.05521 [Cs], Jan. 2018. arXiv.org, http://arxiv.org/abs/1609.05521.

Difficulty

If you have tried playing the game yourself, you might have noticed that the programmed bots are not the smartest of opponents. They will often get stuck against walls or randomly run across the map. This also means that the challenge offered to the agent has some limitations. It might be interesting to see whether we can continue improving the skills of the model by letting it play against versions of itself. Stronger opponents means that the agent will potentially learn more interesting strategies.

Aggressiveness

The agent often prefers attacking to picking strategic items or protecting himself. It is hard to point to a single cause, but it might be possible to mitigate this behaviour by picking different weights for the reward shaping process or defining new actions to be reinforced altogether.

Conclusion

If you have been reading thus far, thank you! I hope that you found some value in following along this series of articles. To me, it was quite a challenging project as most of the RL concepts were new. The journey involved a gruelling amount of trial and error to find the right parameters and troubleshoot issues with learning environments. Perhaps the most challenging of all was to remain highly organised with the hundreds of different experiments, each one with potentially different parameters and code versions. A good structure and extensive logging were vital to accurately compare the outcomes and keep selecting the parameters that improved the code. Nevertheless, all those efforts were worth it as the final results are very pleasing. I definitely enjoyed learning a new interesting topic and I look forward to seeing you again for another article sometime in the near future!

--

--

Leandro Kieliger

Data Scientist at Swisscom, transforming network interactions into actionable insights.