deepq – That's so deep dude

DQN in crazy climber just cares about speed

Continuing on the series of games that DQN and its descendants learn to play well: crazy climber. In this game, you climb a building by moving the joystick up, which lifts your arms, and pulling it back down, which lifts you up. You are grabbing onto the ledge of windows which are open. The windows occasionally close, and if you are holding from their ledge at the time, you fall.

DQN learns to play this game reasonably well in no time at all. This, I think, is because it receives rewards for every step up it climbs. This extremely dense reward suits Q-learning perfectly, and it just learns to climb extremely fast.

The problem is that it seems to get addicted to just climbing fast, and it ignores almost all of the rest of the game. It is so fast that windows very rarely close on it, so in situations where a human would stop and wait, it just runs through them and gets away with it most of the time.

Already by the end of the first epoch it knows to just leg it up quickly, but it doesn’t know to move sideways once it hits the end of a column onto one with more floors to climb, only drifting sideways by random movements, like this:

Learning to move quickly to the open column takes it a couple more epochs.

It later seems to learn to hold on to the ledge with both hands when it’s going to get hit by something, like an egg dropped by a bird. If you are not holding on, you fall. It also learns to catch onto the helicopters at the end of each level for bonus points.

What it never really seems to learn is to wait for the windows to reopen, or to move sideways onto a clearer path. This still lets it get to about 100,000 points on average due to the huge bonuses it gets for completing the levels quickly. This is higher than DeepMind’s testers which is about 35,000 points. The world record is a surprisingly-low-to-me 219,000 which is probably due to the cartridge having been hard to find in its console days. (While I was writing this post, the APE-X paper was published claiming a score of around 320,000 in Crazy climber. This would beat both the human record, and it probably means it learnt other tricks)

The usual learning progress video:

You can download a trained version of the network from https://organicrobot.com/deepqrl/crazy_climber_20151018_e90_54374b2c86698bd8c71ca8d5936404340c0bea2d.pkl.gz . It was trained by running run_nature.py.

To use the trained network, you’ll need to get some old versions of the respective code repositories because this was trained back in October 2015 Late November 2015 versions of Nathan Sprague’s deep Q learning code https://github.com/spragunr/deep_q_rl, or commit f8f2d600b45abefc097f6a0e2764edadb24aca53 on my branch https://github.com/alito/deep_q_rl, Theano https://github.com/Theano/Theano commit 9a811974b934243ed2ef5f9b465cde29a7d9a7c5, Lasagne https://github.com/benanne/Lasagne.git commit b1bcd00fdf62f62aa44859ffdbd72c6c8848475c and cuDNN 5.1 work.

DQN learns video pinball

Like Atlantis, Video pinball is another game where DQN and its derivatives do very well compared to humans, scoring 10s of times higher than us. I looked at the game and trained an agent using Nathan Sprague’s code, which you can get from https://github.com/spragunr/deep_q_rl, to see what was happening.

In summary: patience was what was happening. The secret to the game, or at least how DQN manages to score highly, is by very frequent and well timed bumping, putting and keeping the ball in a vertical bounce pattern across one of the two point-gathering zone that you can see surrounded by the vertical bars at the left and right near the top of the screen, like this:

It’s not easy to tell but there’s a lot of nudging of the ball going on while it’s in flight.

Here’s the usual learning graph:

On the left is the average score per epoch, on the right the maximum score per epoch. For reference, in the DeepMind papers, they tend to list human scores of around 20,000. As noted at the bottom top human scores are actually far above this.

The progress video is not very interesting. The untrained network doesn’t know how to launch the ball, so there’s nothing to see. It then learns to launch the ball and perform almost constant but undirected bumps, which get it to around 20,000 points (around “human level”), before discovering its trick, after which the scores take off. There’s probably more noticeable progress along the line related to recovery techniques in the cases where the ball escapes, but it doesn’t feel worth trawling through hours of video to discover them.

In summary, like in the case of Atlantis, I think this game is “solved”, and it shouldn’t be used for algorithm comparison in a quantitative way. It could just be a binary “does it discover the vertical trapping technique?” like Atlantis is about “does it discover that shooting those low fast-flying planes is all important?” and that’s all. The differences in score are probably more to do with the random seed used during the test run.

Unlike in Atlantis, these scores aren’t actually superhuman. TwinGalaxies reports a top score of around 38 million. You can watch this run in its full 5-hour glory here: http://www.twingalaxies.com/showthread.php/173072. This is still far, far above whatever DQN gets.

I’ve uploaded the best-performing network to http://organicrobot.com/deepqrl/video_pinball_20170129_e41_doubleq_dc48a1bab611c32fa7c78f098554f3b83fb5bb86.pkl.gz. You’ll need to get Theano from around January 2017 to run it (say commit 51ac3abd047e5ee4630fa6749f499064266ee164) since I trained this back then and they’ve changed their format since then. I think I used the double Q algorithm, but it doesn’t make much difference whether you use that or the standard DQN I imagine.

Deep Q network trained to play Galaxian

I think DeepMind was a bit unlucky in not choosing Galaxian as one of their original set of games to train on. It’s a very popular game, much more interesting than its Space Invaders ancestor, and, more importantly, their algorithm learns great on it.

Here are the usual graphs for it:
Note that, unlike in the Space Invaders case, these scores are very much human-level. Those scores in the high 20,000 range are really hard to get unless you play the game a lot. Even the 10,000 average is very respectable. It also very quickly reaches perfectly-reasonable and quite human-styled levels of play (less than 10 epochs).

I made a learning-progression video for it:

I used run_nature.py from Nathan Sprague’s code (https://github.com/spragunr/deep_q_rl). You’ll need a newish version of ALE (https://github.com/mgbellemare/Arcade-Learning-Environment) to train this since code to support Galaxian was only incorporated recently (commit c62378aece70cfb119d0d4111d15b4b830a23d6c or later).

Truly superhuman results with DeepMind’s DeepQ algorithm

While DeepMind’s papers show their algorithm doing better than some random person on a game they don’t play much while having to use a keyboard when playing for five minutes (“expert human player” scoring 31 on breakout, c’mon), they don’t try to show their programs getting better scores than human experts at their own games. I am not really sure why they chose their 5-minute limit for reporting, but the rest of us don’t have to abide by it, and in here we can show that if you remove the 5-minute limit, the algorithm can beat the best human scores in at least one game.

One of the games that stands near the top of the DeepQ / human ratio in their papers is Atlantis. It’s a game where planes fly overhead and you shoot them down from one of three guns. Most planes that fly past are harmless and are only there to score points from. Every now and then a plane firing a laser down comes across and if you fail to shoot it down it will destroy one of your guns or one of your other “buildings”. When all of your buildings are destroyed, the game is over. One unusual feature of the game is that your buildings and guns reappear if you score enough points.

It would be very easy to write a program to play this game almost perfectly if you could hand-code it. The sprites are pretty much always the same, the planes always appear at the same height and the game never changes. You could write an if-detect-plane-here-then-shoot program and it would do amazingly well. It’s a lot harder if you are not hand-coding almost anything like DeepQ does.

DeepQ learns to play this game relatively quickly. There is constant feedback from shooting down planes, and because of the loss-of-life marker, the program gets instant feedback when one of its buildings or guns is destroyed. Here’s the training graph for it.

And here’s the first 60 epochs or so since the latter epochs are hiding the early landscape:

As you can see, the learning is quite gradual until the huge spikes. There is even quite a bit of improvement before epoch 30 that is being hidden in the zoomed in image, although given the constant feedback I’d have expected it to have improved even faster. The scores are very high from epoch 40 onwards. For reference the score quoted on DeepMind’s Nature paper ( http://www.nature.com/nature/journal/v518/n7540/full/nature14236.html ) are around 30000 for humans and 85000 for DQN.

The spikes on the first graph are huge, but they turn out to be artificially LOW because of the limit of playing time imposed on the testing period during training (125,000 actions, which is roughly two hours 20 minutes of play). I grabbed the model that produced the highest one, on epoch 78, and set it to play with unlimited time. After over 20 hours of gameplay, with the emulator running very slowly (bug yet to be investigated), and me doubting it was ever going to finish the game, I killed the run.

I’ve put the first 9 and a quarter hours of the game on Youtube:

It scores around 15 million points during the video, and I think over 30 million during the run (it isn’t easy to check since the points clock over every million so you have to find all the clocking overs during the run). For comparison, the highest score recorded in TwinGalaxies is around 6.6 million for someone using a real Atari, and around 10 million for someone using an emulator.

If you watch the game even for a little bit you can tell that the computer doesn’t play very well at all. It constantly misses on most of its shots. What it does super-humanly though, is kill the laser-shooting planes, and since that’s the only way you can die, it can play for a very long time.

For those of you who want to try this out, I used Nathan Sprague’s brilliant reimplementation of DeepMind’s DeepQ algorithm which you can get at https://github.com/spragunr/deep_q_rl. I ran it with the “Nature”-sized model. You can also download the pre-trained model from http://organicrobot.com/deepqrl/atlantis_20151122_e78_94f7d724f69291ecfe23ac656aef38c63a776b8d.pkl.gz