DQN in crazy climber just cares about speed

Continuing on the series of games that DQN and its descendants learn to play well: crazy climber. In this game, you climb a building by moving the joystick up, which lifts your arms, and pulling it back down, which lifts you up. You are grabbing onto the ledge of windows which are open. The windows occasionally close, and if you are holding from their ledge at the time, you fall.

DQN learns to play this game reasonably well in no time at all. This, I think, is because it receives rewards for every step up it climbs. This extremely dense reward suits Q-learning perfectly, and it just learns to climb extremely fast.

The problem is that it seems to get addicted to just climbing fast, and it ignores almost all of the rest of the game. It is so fast that windows very rarely close on it, so in situations where a human would stop and wait, it just runs through them and gets away with it most of the time.

Already by the end of the first epoch it knows to just leg it up quickly, but it doesn’t know to move sideways once it hits the end of a column onto one with more floors to climb, only drifting sideways by random movements, like this:

Learning to move quickly to the open column takes it a couple more epochs.

Crazy climber learning curve

It later seems to learn to hold on to the ledge with both hands when it’s going to get hit by something, like an egg dropped by a bird. If you are not holding on, you fall. It also learns to catch onto the helicopters at the end of each level for bonus points.

What it never really seems to learn is to wait for the windows to reopen, or to move sideways onto a clearer path. This still lets it get to about 100,000 points on average due to the huge bonuses it gets for completing the levels quickly. This is higher than DeepMind’s testers which is about 35,000 points. The world record is a surprisingly-low-to-me 219,000 which is probably due to the cartridge having been hard to find in its console days. (While I was writing this post, the APE-X paper was published claiming a score of around 320,000 in Crazy climber. This would beat both the human record, and it probably means it learnt other tricks)

The usual learning progress video:

You can download a trained version of the network from https://organicrobot.com/deepqrl/crazy_climber_20151018_e90_54374b2c86698bd8c71ca8d5936404340c0bea2d.pkl.gz . It was trained by running run_nature.py.

To use the trained network, you’ll need to get some old versions of the respective code repositories because this was trained back in October 2015 Late November 2015 versions of Nathan Sprague’s deep Q learning code https://github.com/spragunr/deep_q_rl, or commit f8f2d600b45abefc097f6a0e2764edadb24aca53 on my branch  https://github.com/alito/deep_q_rl, Theano https://github.com/Theano/Theano commit 9a811974b934243ed2ef5f9b465cde29a7d9a7c5, Lasagne https://github.com/benanne/Lasagne.git commit b1bcd00fdf62f62aa44859ffdbd72c6c8848475c and cuDNN 5.1 work.

Comparison of human scores and “human scores” in Atari

Below is a list of the different scores reported for humans on Atari games in the reinforcement learning community and the highest scores reported in TwinGalaxies. The conditions under which these corresponding scores are achieved are different: the RL community tests humans on an emulator using the keyboard, the game can only last up to 5 minutes and there’s no sound (ie they replicate the conditions that the algorithm experiences). It also reports the average, not the maximum values obtained. On the other hand, TwinGalaxies is a site that collects records, so it is only interested in maximum values. The differences in the scores tends to be very big, often by more than 100 times.

I think the maximum values achieved are important benchmarks reflecting the levels that can be achieved by a human brain finely trained on a task. It’s also quite apparent that, even though the reported records are maxima, the average score for the people who achieved these records would not be anywhere near as low as the averages reported for humans in the RL literature. The five minute mark will become a problem in trying to beat these top scores, but that time limit can be removed. It doesn’t tend to be a bottleneck in most cases at the moment since the agents can rarely play for five minutes.

In the following table, “DNADRL human” refers to the human scores reported in “Dueling Network Architectures for Deep Reinforcement Learning” by Ziyu Wang, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot and Nando de Freitas https://arxiv.org/abs/1511.06581. Similar scores are seen in other papers in the literature. TwinGalaxies (any) is the highest score for that game with difficulty setting B achieved in either a console or an emulator, although they almost completely consist of console scores. Factor is just the TwinGalaxies score divided by the DNADRL score.

Game DNADRL human TwinGalaxies (any) Factor Link
Alien 7,127.7 255,265 35.8 http://www.twingalaxies.com/scores.php?scores=2316
Amidar 1,719.5 104,159 60.6 http://www.twingalaxies.com/showthread.php/140175
Assault 742.0 8,647 11.7 http://www.twingalaxies.com/scores.php?platformid=5&gamename=Assault
Asterix 8,503.3 1,000,000 117.6 http://www.twingalaxies.com/showthread.php/150583
Asteroids 47,388.7 10,004,100 211.1 http://www.twingalaxies.com/scores.php?scores=2315
Atlantis 29,028.1 10,604,840 365.3 http://www.twingalaxies.com/scores.php?platformid=5&gamename=Atlantis
Bank Heist 753.1 82,058 109.0 http://www.twingalaxies.com/scores.php?platformid=5&gamename=Bank%20Heist
Battle Zone 37,187.5 1,545,000 41.5 http://www.twingalaxies.com/scores.php?scores=2737
Beam Rider 16,926.5 999,999 59.1 http://www.twingalaxies.com/scores.php?scores=2738
Berzerk 2,630.4 1,057,940 402.2 http://www.twingalaxies.com/showthread.php/149853
Bowling 160.7 300 1.9 http://www.twingalaxies.com/showthread.php/159727
Boxing 12.1 01:42.0 http://www.twingalaxies.com/scores.php?scores=740″
(time remaining after getting to 100 is the measurement of choice)
Breakout 30.5 864 28.3 http://www.twingalaxies.com/scores.php?scores=5389
(864 is the maximum score)
Centipede 12,017.0 1,301,709 108.3 http://www.twingalaxies.com/showthread.php/167433
Chopper Command 7,387.8 999,999 135.4 http://www.twingalaxies.com/showthread.php/140559
Crazy Climber 35,829.4 219,900 6.1 http://www.twingalaxies.com/showthread.php/167082
Defender 18,688.9 5,443,150 291.3 http://www.twingalaxies.com/scores.php?scores=2296
Demon Attack 1,971.0 1,556,345 789.6 http://www.twingalaxies.com/scores.php?scores=2295
Double Dunk -16.4
Enduro 860.5 118,000.00 km 137.1 http://www.twingalaxies.com/scores.php?scores=2106
(Note: different scoring, and top score is 30 times the second!)
Fishing Derby -38.7 71 -1.8 http://www.twingalaxies.com/scores.php?scores=13743
Freeway 29.6 38 1.3 http://www.twingalaxies.com/showthread.php/145169
Frostbite 4,334.7 1,832,730 422.8 http://www.twingalaxies.com/scores.php?scores=3770
Gopher 2,412.5 829,440 343.8 http://www.twingalaxies.com/scores.php?scores=3775
Gravitar 3,351.4 999,950 298.4 http://www.twingalaxies.com/scores.php?scores=3777
H.E.R.O. 30,826.4 1,000,000 32.4 http://www.twingalaxies.com/showthread.php/157083
Ice Hockey 0.9 36 40.0 http://www.twingalaxies.com/showthread.php/167393
James Bond 302.8 45,550 150.4 http://www.twingalaxies.com/scores.php?scores=13897
Kangaroo 3,035.0 1,424,600 469.4 http://www.twingalaxies.com/scores.php?scores=3788
Krull 2,665.5 1,245,900 467.4 http://www.twingalaxies.com/scores.php?scores=3793
Kung-Fu Master 22,736.3 1,000,000 44.0 http://www.twingalaxies.com/scores.php?scores=3794
Montezuma’s Revenge 4,753.3 1,219,200 256.5 http://www.twingalaxies.com/scores.php?scores=3808
Ms. Pac-Man 6,951.6 2,654,680 381.9 http://www.twingalaxies.com/scores.php?scores=3816
Name This Game 8,049.0 25,220 3.1 http://www.twingalaxies.com/scores.php?scores=3817
Phoenix 7,242.6 4,014,440 554.3 http://www.twingalaxies.com/scores.php?scores=3829
Pitfall! 6,463.7 114,000 17.6 http://www.twingalaxies.com/showthread.php/146051
Pong 14.6 Understandably, no records are kept
Private Eye 69,571.3 118,000 1.7 Game on difficulty A is what is tracked
Q*Bert 13,455.0 2,400,000 178.4 http://www.twingalaxies.com/showthread.php/164665
River Raid 17,118.0 1,000,000 58.4 http://www.twingalaxies.com/scores.php?scores=2133
Road Runner 7,845.0 2,038,100 259.8 http://www.twingalaxies.com/scores.php?scores=5152
Robotank 11.9 76 6.4 http://www.twingalaxies.com/scores.php?scores=5270
Seaquest 42,054.7 999,999 23.8 http://www.twingalaxies.com/scores.php?scores=2136
Skiing -4,336.9 -3272 0.8 http://www.twingalaxies.com/showthread.php/141803
I think I’m translating the score right
Solaris 12,326.7 57,840 4.7 http://www.twingalaxies.com/showthread.php/140250
Space Invaders 1,668.7 621,535 372.5 http://www.twingalaxies.com/scores.php?scores=2172
Stargunner 10,250.0 69,400 6.8 http://www.twingalaxies.com/scores.php?scores=13927
Surround 6.5
Tennis -8.3
Time Pilot 5,229.2 112,100 21.4 http://www.twingalaxies.com/scores.php?scores=2530
Tutankham 167.6
Up and Down 11,693.2 75,230 6.4 http://www.twingalaxies.com/scores.php?platformid=5&gamename=Up%20%27N%20Down
Venture 1,187.5 913,200 769.0 http://www.twingalaxies.com/scores.php?scores=2223
Video Pinball 17,667.9 35,197,952 1992.2 http://www.twingalaxies.com/scores.php?scores=3032
Wizard Of Wor 4,756.5 864,500 181.8 http://www.twingalaxies.com/scores.php?scores=3035
Yars’ Revenge 54,576.9 15,000,105 274.8 http://www.twingalaxies.com/showthread.php/161781
(maybe different game)
Zaxxon 9,173.3 772,400 84.2 http://www.twingalaxies.com/scores.php?scores=3039

DQN learns video pinball

Like Atlantis, Video pinball is another game where DQN and its derivatives do very well compared to humans, scoring 10s of times higher than us. I looked at the game and trained an agent using Nathan Sprague’s code, which you can get from https://github.com/spragunr/deep_q_rl, to see what was happening.

In summary: patience was what was happening. The secret to the game, or at least how DQN manages to score highly, is by very frequent and well timed bumping, putting and keeping the ball in a vertical bounce pattern across one of the two point-gathering zone that you can see surrounded by the vertical bars at the left and right near the top of the screen, like this:

It’s not easy to tell but there’s a lot of nudging of the ball going on while it’s in flight.

Here’s the usual learning graph:

On the left is the average score per epoch, on the right the maximum score per epoch. For reference, in the DeepMind papers, they tend to list human scores of around 20,000. As noted at the bottom top human scores are actually far above this.

The progress video is not very interesting. The untrained network doesn’t know how to launch the ball, so there’s nothing to see. It then learns to launch the ball and perform almost constant but undirected bumps, which get it to around 20,000 points (around “human level”), before discovering its trick, after which the scores take off. There’s probably more noticeable progress along the line related to recovery techniques in the cases where the ball escapes, but it doesn’t feel worth trawling through hours of video to discover them.

In summary, like in the case of Atlantis, I think this game is “solved”, and it shouldn’t be used for algorithm comparison in a quantitative way. It could just be a binary “does it discover the vertical trapping technique?” like Atlantis is about “does it discover that shooting those low fast-flying planes is all important?” and that’s all. The differences in score are probably more to do with the random seed used during the test run.

Unlike in Atlantis, these scores aren’t actually superhuman. TwinGalaxies reports a top score of around 38  million. You can watch this run in its full 5-hour glory here: http://www.twingalaxies.com/showthread.php/173072. This is still far, far above whatever DQN gets.

I’ve uploaded the best-performing network to http://organicrobot.com/deepqrl/video_pinball_20170129_e41_doubleq_dc48a1bab611c32fa7c78f098554f3b83fb5bb86.pkl.gz. You’ll need to get Theano from around January 2017 to run it (say commit 51ac3abd047e5ee4630fa6749f499064266ee164) since I trained this back then and they’ve changed their format since then.  I think I used the double Q algorithm, but it doesn’t make much difference whether you use that or the standard DQN I imagine.

Telling Python to prefer your user’s private directory over the global library directory

Just a quick note since this caused quite a bit of frustration for me over the last couple of days:

I usually install Python modules using ‘pip install –user <module>’ since this is nice and safe, and I don’t need to sudo to do it. These modules are installed, by default, to ${HOME}/.local/lib/python<version>/site-packages. This has served me well over the years, and it’s something that worked before pip going back to easy_install and even before when running ‘python setup.py install –user’ manually.

Recently, trying to debug problems I am having installing TensorFlow, I noticed that the global packages were taking priority over my user-local modules in the cases where the module was installed in both. eg if I
    >>> import six
    >>> six.__file__
    /usr/lib/python2.7/dist-packages/six.pyc
even though I have six installed locally. Note, this is the debian-custom global location for Python modules. I don’t think the fact that I am using a Debian-derivative modifies this story, but I am not completely sure.

After a lot of Googling and looking through site.py, this ended up being due to:

  1. Having ENABLE_USER_SITE = None in /usr/lib/python2.7/site.py although I don’t remember ever having changed this before and this used to work ok. Anyway, set this to True
  2. More importantly, there was a row in ${HOME}/.local/lib/python2.7/easy-install.pth saying ‘/usr/lib/python2.7/dist-packages’. All these directories get pre-pended to sys.path, so this sneaks the global path above my user-local. Removing this line fixes the problem.

/usr/lib/python2.7/dist-packages gets added to sys.path anyway later in site.py, but it gets appended instead of pre-prended, so all works out well.

Deep Q network trained to play Galaxian

I think DeepMind was a bit unlucky in not choosing Galaxian as one of their original set of games to train on. It’s a very popular game, much more interesting than its Space Invaders ancestor, and, more importantly, their algorithm learns great on it.

Here are the usual graphs for it:
Note that, unlike in the Space Invaders case, these scores are very much human-level. Those scores in the high 20,000 range are really hard to get unless you play the game a lot. Even the 10,000 average is very respectable. It also very quickly reaches perfectly-reasonable and quite human-styled levels of play (less than 10 epochs).

I made a learning-progression video for it:

I used run_nature.py from Nathan Sprague’s code (https://github.com/spragunr/deep_q_rl). You’ll need a newish version of ALE (https://github.com/mgbellemare/Arcade-Learning-Environment) to train this since code to support Galaxian was only incorporated recently (commit c62378aece70cfb119d0d4111d15b4b830a23d6c or later).

Truly superhuman results with DeepMind’s DeepQ algorithm

While DeepMind’s papers show their algorithm doing better than some random person on a game they don’t play much while having to use a keyboard when playing for five minutes (“expert human player” scoring 31 on breakout, c’mon), they don’t try to show their programs getting better scores than human experts at their own games. I am not really sure why they chose their 5-minute limit for reporting, but the rest of us don’t have to abide by it, and in here we can show that if you remove the 5-minute limit, the algorithm can beat the best human scores in at least one game.

One of the games that stands near the top of the DeepQ / human ratio in their papers is Atlantis. It’s a game where planes fly overhead and you shoot them down from one of three guns. Most planes that fly past are harmless and are only there to score points from. Every now and then a plane firing a laser down comes across and if you fail to shoot it down it will destroy one of your guns or one of your other “buildings”. When all of your buildings are destroyed, the game is over. One unusual feature of the game is that your buildings and guns reappear if you score enough points.

It would be very easy to write a program to play this game almost perfectly if you could hand-code it. The sprites are pretty much always the same, the planes always appear at the same height and the game never changes. You could write an if-detect-plane-here-then-shoot program and it would do amazingly well. It’s a lot harder if you are not hand-coding almost anything like DeepQ does.

DeepQ learns to play this game relatively quickly. There is constant feedback from shooting down planes, and because of the loss-of-life marker, the program gets instant feedback when one of its buildings or guns is destroyed. Here’s the training graph for it.

And here’s the first 60 epochs or so since the latter epochs are hiding the early landscape:

As you can see, the learning is quite gradual until the huge spikes. There is even quite a bit of improvement before epoch 30 that is being hidden in the zoomed in image, although given the constant feedback I’d have expected it to have improved even faster. The scores are very high from epoch 40 onwards. For reference the score quoted on DeepMind’s Nature paper ( http://www.nature.com/nature/journal/v518/n7540/full/nature14236.html ) are around 30000 for humans and 85000 for DQN.

The spikes on the first graph are huge, but they turn out to be artificially LOW because of the limit of playing time imposed on the testing period during training (125,000 actions, which is roughly two hours 20 minutes of play). I grabbed the model that produced the highest one, on epoch 78, and set it to play with unlimited time. After over 20 hours of gameplay, with the emulator running very slowly (bug yet to be investigated), and me doubting it was ever going to finish the game, I killed the run.

I’ve put the first 9 and a quarter hours of the game on Youtube:

It scores around 15 million points during the video, and I think over 30 million during the run (it isn’t easy to check since the points clock over every million so you have to find all the clocking overs during the run). For comparison, the highest score recorded in TwinGalaxies is around 6.6 million for someone using a real Atari, and around 10 million for someone using an emulator.

If you watch the game even for a little bit you can tell that the computer doesn’t play very well at all. It constantly misses on most of its shots. What it does super-humanly though, is kill the laser-shooting planes, and since that’s the only way you can die, it can play for a very long time.

For those of you who want to try this out, I used Nathan Sprague’s brilliant reimplementation of DeepMind’s DeepQ algorithm which you can get at https://github.com/spragunr/deep_q_rl. I ran it with the “Nature”-sized model. You can also download the pre-trained model from http://organicrobot.com/deepqrl/atlantis_20151122_e78_94f7d724f69291ecfe23ac656aef38c63a776b8d.pkl.gz

DeepMind’s Atari paper replicated

This is a followup on the previous post. Most of the results have been replicated. Turning off Nesterov momentum, pumping RMSprop decay to 0.99 and setting the discount rate to 0.95 did the trick. The spikes in the Q-score plot that were observed with the parameters used previously that I mentioned in the last post seemed to have been caused by the momentum making the optimisation diverge, not by Q-learning itself diverging as I thought previously.

Just to be completely clear, none of this is by me: all the learning code, to which I will refer to as deep_q_rl, is by Nathan Sprague, and the algorithm is from the paper “Playing Atari with Deep Reinforcement Learning” by Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra and Martin Riedmiller that you can get from http://arxiv.org/abs/1312.5602. You can get the code from https://github.com/spragunr/deep_q_rl. I’m mainly acting as a tester.

(As I was writing this post (it has taken me almost two weeks to write), DeepMind published their updated paper on Nature (http://www.nature.com/nature/journal/v518/n7540/full/nature14236.html), and this time they’ve released their source code: https://sites.google.com/a/deepmind.com/dqn/. In this post I am mainly comparing to the original NIPS workshop paper. I will only mention the Nature paper where it seemed to clarify something about the NIPS paper)

When I write that the results are replicated, I mean that I think that the results produced by deep_q_rl are probably as good as the results that DeepMind obtained when they wrote the paper, and that all the original results seem believable considering these results. The Q-value curves still look very different, most likely because they were using a higher discount rate (*in their Nature paper they mention a discount rate of 0.99 while this code uses 0.95). Also, depending on the interpretation of the table where they list their results, DeepMind’s results might have been significantly better. It is hard to tell exactly what they mean by “average total reward” in table 1. Do they mean across the whole 200 epochs? Or testing their best model? Or testing only the final network? Or across a large swathe of the networks that were trained?  I’ve taken a liberal interpretation of them to mean roughly how well does the agent score across many epochs once it looks like it has learnt. (From their Nature paper, it seems that they mix their use of episode and epoch a bit, so that the “best” seems to mean average across the testing epoch where it averaged highest, while “average” could be the average over all testing episodes)

Here are DeepMind’s average and best scores as listed in table 1:

Game Beam rider Breakout Enduro Pong Q*Bert Seaquest Space invaders
DeepMind average 4092 168 470 20 1952 1705 581
DeepMind best 5184 225 661 21 4500 1740 1075

I think deep_q_rl does clearly worse than DeepMind’s code in Space Invaders, a little bit worse in Breakout, similar in Pong and Enduro, better on Q*Bert, and worse, similar or much better on Seaquest depending on your angle. I did not test on Beam rider.

I’ll just post the average and max-score curves obtained by deep_q_rl, some qualitative commentary on how the agent plays on each game, and videos showing good (biased) examples of the agent playing or montages of it learning. The graphs show average test score on the left and highest test score on the right. The x-axis is the epoch. They were obtained using a test epsilon of 0.05 (ie on each frame there is a five percent chance of performing a random action) just to be able to compare them with DeepMind’s results. I prefer watching the results of epsilon 0.01 (ie 1 percent chance of random action) since it makes it easier to tell policy problems from random errors, so I’ll be describing the behaviour of runs using those settings. Also, let me be very clear that even though I might have some negative comments on the behaviour of the agent, this is all to be taken as relative to the absolute level of blown-awayedness that I feel regarding that this works at all.

Pong

Pong is very simple and the agent learns very quickly. It plays extremely aggressively always trying to win in one hit. It mostly wins 21-0 or 21-1.

Breakout

The code learns to play breakout relatively quickly, taking about 40 epochs (1 epoch = 50000 frames) to get reasonably good at it. It plays without any seeming purpose, hitting the ball when it comes down but not aiming anywhere.
At some point it gets good enough to make a tunnel in the wall, just by repeated random hitting, and put the ball through the tunnel, scoring lots of points. It never learns to hit the ball after it comes back down through the tunnel. I think that the image looks different enough that the network does not know what to do. It mostly just sits catatonic with the paddle against the left or right wall. I have never seen it clear a level.
The difference between its high scores and its average scores once it has learnt are obtained by it being lucky, first of all by getting the ball through the tunnel quickly once it’s formed, since once the tunnel is formed its play deteriorates, and secondly by either having the ball hang around above the wall for longer or by having the paddle in the right place by accident once the ball does come back down.

Highest scored observed: 391. Obtained just as described above: random hitting leading to a hole in the middle of the wall, ball goes up and hangs around for a while, hole is in the right place for a new ball to go through it if hit by a paddle laying against the left wall.

The paper lists the average human score as 31, which I can only explain if they were using a keyboard to play the game. This game is meant to be played with a paddle, and 31 would be a pathetic score. I don’t think the scores attained by the network as better than human level.

Space invaders

I expected the code to learn space invaders quite well, since it is a very simple game where the rewards come very shortly after the actions and where random thrashing at the beginning can quickly tell it what actions lead to good rewards, but it does not do very well at it, at least compared to a human.
It does learn very quickly that shooting is a good thing and within 5 epochs it gets reasonably good at shooting down invaders. It just never gets very good at evading invaders’ bullets. It does slowly improve at both tasks, and by epoch 180 or so, its play looks very similar to a novice human player, but it doesn’t keep on improving.
The huge difference between the average scores and the maximum scores for each epoch are completely due to luck. It will randomly shoot down a passing mothership which give over 300 points each. Due to the way that the image is cropped while fed to the agent, it cannot actually see the mothership though, so it being able to shoot it down is most likely just due to luck (unless it managed to find the pattern of when they appear and from which direction, something I can’t rule out).
DeepMind’s version did not do that well either, but clearly better than this one.
Highest observed score: 925 (shot two motherships)

Enduro

Enduro is my favourite game to watch it play in. Its style is very non-human, combining super-human reflexes in some parts and parts where it just doesn’t know what to do.
It takes ages for it just to get any points at all in this game (almost 30 epochs). This is because points here are gotten by passing cars, and you need to keep the accelerator pressed for quite a few frames in a row to gather enough speed to pass a car. This is something that isn’t likely to happen with random joystick movements. Once it learns to do that though, it quickly gets to very high scores, learning to dodge cars at very high speed in quite different-looking environments.
It never learns what to do in the bright/glare/snow/whatever-it-is stage though, where it just waits it out or goes full tilt till it slams into the car in front of it. It is the only stage where the background is brighter than the objects to avoid, so it is understandable for it to have trouble transferring the learning from the other sections. It also learnt that you cannot gain or lose places while the flags are waving in between levels, so it doesn’t even try while they are being shown.
The high scores here track the average. Enduro takes ages to play so most of the time only one game fits in a testing epoch (10,000 frames).
When running without the 10,000 frames limit, I observed a run scoring 1080. It is part of the video below. It goes for over 7 minutes at 2x normal speed.

Seaquest

I didn’t think the code would learn to play Seaquest because, unlike in all the other games in this page, all 18 actions are available (ie up-left, up, up-right, left, neutral, right, bottom-left, bottom, bottom-right, then each again but with the button pressed), which means that picking the right one is harder. It does learn to play it, though, in a very shallow sense.
It progressively gets better at hunting down sharks and at avoiding crashing into them, all the time ignoring the divers. It gets very good at that, but it never goes up for air so it keeps dying due to lack of oxygen. Eventually, it discovers that going up for air means getting points for rescued divers. This seems to have a negative effect on its agility though and it starts crashing into everything, and its scores get hammered.
Its best runs are in that short period were it is losing its basic skills but it has learnt about surfacing. The highest score I observed was 4000. It is included in the video below.

Q*Bert

It learns to cover the first level very efficiently. The second level is a slightly different colour though, so that trips it up. It manages to learn to get some points on the second level but that seems to have a negative effect on its proficiency on the first level. I left it training for a while to see if it’d make the jump but no cigar.
Highest score I observed was 5600.

Replicating the Atari-playing paper from Deep Mind

Deep Mind is a company that got bought by Google at the beginning of 2014 for the usual gazillion dollars that everything seems to get bought by nowadays. They published a paper not long before that showing how they combined a deepish convolutional neural network with Q-learning to get computers to learn to play quite a few Atari games solely from the screen pixels and the score. This is something I’ve tried to do before so I thought it was very exciting news and that I should try to replicate it.

By the way, if you haven’t seen any of the demo videos showing the results, or haven’t read the paper,  go have a look. The paper is “Playing Atari with Deep Reinforcement Learning” by Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra and Martin Riedmiller . Get it from http://arxiv.org/abs/1312.5602 The rest of the post will make much more sense if you have read through the paper.

A year passed and I still hadn’t attempted any replication. This was quite lucky since Nathan Sprague did, and some of the Theano code he needed to write looks beyond me. His code is here: https://github.com/spragunr/deep_q_rl (which I’ll call the Sprague code).

There are probably a few differences in the details between the implementations by Prof Sprague and Deep Mind. The ones that I noticed are the following:

  • The Sprague code downsamples the 160 x 210 screen to 80 x 105, and then squeezes this into an 80 x 80 input layer, while the Deep Mind version downsamples to 84 x 110 and then crops an 84 x 84 section of it for the input layer. The 80 was most likely chosen by Prof Sprague so that the downsampling would be neat (two to one), but I suspect that the 84 was chosen for the same reason ie so that the resultant image would smear a bit, making everything a bit larger.
  • While the Deep Mind implementation mentions that they used Rprop for the weight updating, the Sprague code implements RMSProp with Nesterov momentum. 
  • The discount rate (gamma in Q-learning) is not mentioned in the paper. The Sprague code sets it to 0.9

I got some time to download the code and test it a bit over the new year’s break and it does brilliantly,  even though not quite as brilliantly as what’s described in the paper. Deep Mind’s paper claim they got an average score of 168 on breakout and 1705 on seaquest while the Sprague code topped out at about 50 for breakout and didn’t seem to learn on seaquest. I haven’t tried it on any other games yet.

Each 100-episode run takes a bit over 24 hours on my beastly new GTX 970, so there haven’t been too many experiments yet. I did make quite a few changes to the code, but mostly to make it easier to use and other non-learning-related changes.

The few changes that I’ve made that could affect the learning are the following:

  • Changed it to match the Deep Mind’s paper more closely by using the crop-to-84×84 method, choosing the bottom part of the image.
  • Made the conversion to grayscale follow whatever it is that cv2 and pillow do that tries to match the way that humans perceive luminance which means that it pays more attention to the green component than to the blue. The original Sprague code simply averages the red, green and blue components.
  • The number of actions is now restricted to the number of actions provided by the hardware used for that game. For most games, that is the joystick so all 18 actions are used as by default (that is neutral, up, down, left, right, up-left, up-right, down-left, down-right, and also each one of those with the button pressed). Breakout and pong and some others were played with a paddle though, so only six actions are allowed: neutral, left, right, and each one of those with the button pressed. I am not sure if the original paper did this or not.
  • I’ve played around with the discount rate currently settling on 0.95

As an aside, the change to the input selection showed the sensitivity of the code to the input. I initially used pillow to do the downsampling, instead of the cv2 library as used in the original Sprague code, as pillow is easier to install. This change was enough for the program to stop learning altogether on breakout. Comparing the downsampling produced by pillow vs the downsampling produced by cv2, it seems like cv2 smears the image a bit more, making the ball quite a bit bigger than what happens with the pillow downsampling (note: this is using pillow 2.6.1. The 2.7.0 release seems to change the downsampling algorithm to be a bit more like cv2).

While I still haven’t seen any learning on seaquest, the program got to an average score of slightly over 100 in some of the testing epochs, while regularly averaging over 70. Here is its current masterpiece, a 331 produced after about 18 hours of training:

Here is the more accurate picture of its results, plotting epoch vs score, and the Q-value on a test set, similar to what’s shown on page 5 of the paper:

The score plotted for an epoch is the result of a testing round that is done after that epoch. The testing runs for 10000 frames. That’s about eight or nine games when it’s doing well. Note the collapse of the score around epoch 103 or so, with the Q-value going whacked shortly after that. That, I think, is the Q-learning diverging.  If you look in the paper, their Q-value goes smoothly up. This one never does, not even before the divergence.

My fork is at https://github.com/alito/deep_q_rl . I will probably keep tweaking with it. I expect the code to learn nicely on pong, but I’m not sure on other games. Here are some ideas for it:

  • Try converting the cnn_q_learner to use the default Theano convolutional code which uses the NVIDIA cuDNN code. It should be a bit faster than the cuda_convnet code.
  • Try Rprop instead of RMSprop
  • Test adapting discount rates
  • Change the network architecture (bigger, deeper)
  • Change the algorithm to use more time-separated frames as inputs, rather than the last four frames.

Speeding up libsvm

I mentioned in my first post that a run of libsvm’s grid.py tool to optimise the hyperparameters for MNIST took 36 hours on my computer. This was using a manually compiled version of libsvm using the plain source code from the site. There are two things that can massively speed up your libsvm training runs. They are both mentioned on the libsvm site, but they are probably not given enough prominence.

The first one is to parallelise the inner loop of the code by using OpenMP. This takes four lines of code. If you use Gentoo, the source code is already patched to use this. The speedup is almost linear with the number of processors in your computer. I have 4 hyperthreaded processors in mine, and I’ve got around a 7.5x speedup. You can read about how to do it in the libsvm’s FAQ or just download the patch from Gentoo

The second speedup is to use CUDA. The CUDA implementation of libsvm was written by (at least one of) Andreas Athanasopoulos, Anastasios Dimou, Vasileios Mezaris and Ioannis Kompatsiaris and you can find it at http://mklab.iti.gr/project/GPU-LIBSVM. It speeds up things even more than the OpenMP version, but only under certain cases.

For example, training a subset of MNIST using the OpenMP version:


$ time svm-train -q -v 5 -m 1000 -c 64 -g 0.03125 docs/bigdata/mnist/mnist6000.scale mnist.model

Cross Validation Accuracy = 96.3833%

real 0m21.397s
user 2m45.332s
sys 0m0.254s

Same thing using the CUDA version:


$ time programming/cuda/libsvm-cuda/svm-train-gpu -q -v 5 -c 64 -g 0.03125 docs/bigdata/mnist/mnist6000.scale mnist.model

Cross Validation = 96.3833%

real 0m10.649s
user 0m9.972s
sys 0m0.654s

That’s a two-times speedup over the 8-processor version. (UPDATE: I realised these numbers don’t mean anything if I don’t tell you at least some specs of my machine.  i7-870 with a Geforce GTS 450 with 512 MB)

There are a few caveats and things to keep in mind regarding the CUDA version:

  • While the runtime for the CPU-version of the code scales linearly with the number of cross-folds, the CUDA version’s runtime will scale sublinearly. eg changing that -v 5 to -v 10 takes 16.5 seconds instead of 10.6.
  • The CUDA code only runs for cross-validation-enabled runs. If I hadn’t used -v 5 in that run, the single-threaded CPU version of the code would have run
  • Most importantly: the CUDA version doesn’t implement SMO when solving the SVM, so its space requirements scale quadratically with the number of samples in your dataset. Since my graphics card has 512 MB of RAM, it can only handle about 7000 samples before it crashes (7000 * 7000 * 8 bytes/double ~= 400MB). My pick of 6000 for subset size was a lucky coincidence.
  • The development of the CUDA code seems to have stopped at libsvm 3.0.  I’ve emailed the authors and they replied that they don’t have anyone working on it at the moment but that they are planning to move the code somewhere more accessible so it can be kept up to date by the rest of us.

I have patches for the code to align it with version 3.14 (although I just noticed that libsvm is up to version 3.17), and a Makefile to make it compile on Gentoo.  I pasted the Makefile below since it’s an easy way to get started with the code if you want to try it.


# Change the CUDA_INSTALL_PATH to wherever you have CUDA installed
CUDA_INSTALL_PATH ?= /opt/cuda
NVCC       := $(CUDA_INSTALL_PATH)/bin/nvcc
EXECUTABLE  := svm-train-gpu
CUDACCFLAGS := -po maxrregcount=16
INCLUDES += -I. -I$(CUDA_INSTALL_PATH)/include
LIBS = -lcublas
LD_PATH = -L$(CUDA_INSTALL_PATH)/lib

CXXFLAGS ?= $(CFLAGS)
CXXFLAGS += -fPIC -W -Wall -Wswitch -Wformat -Wchar-subscripts -Wparentheses -Wmultichar -Wtrigraphs -Wpointer-arith -Wcast-align -Wreturn-type -Wno-unused-function -m32 -DUNIX


# Debug/release configuration

ifeq ($(dbg),1)
    CXXFLAGS += -g -D_DEBUG
else
    CXXFLAGS += -O2 -fno-strict-aliasing
endif

all: $(EXECUTABLE)

$(EXECUTABLE): svm.o svm-train.o
$(CXX) $(CXXFLAGS) -o $@ $^ $(LIBS) $(LD_PATH)

svm.o: svm.cpp svm.h

svm-train.o: svm.h svm-train.c kernel_matrix_calculation.c cross_validation_with_matrix_precomputation.c
$(CXX) $(CXXFLAGS) $(INCLUDES) -c -o $@ svm-train.c

clean:
rm svm.o svm-train.o svm-train-gpu

Let’s start with MNIST

So, lots of Coursera subjects done and then crap all produced. Geoffrey Hinton’s Neural Networks for Machine Learning was very interesting. I’ll do something with that.

Interesting paper number one: “Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion” by Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio and Pierre-Antoine Manzagol. Stacked = stacked. Denoising = get rid of noise. Autoencoders = things (single-hidden layer neural networks in this case) that encode themselves (not true here, they encode the input, and they try to make their output look like their input). They trained these stacked denoising autoencoders (SDAs) on a few standard datasets and use the hidden layer representations as inputs to support vector machines (SVMs) for classification. They get very good results.

I want to replicate these results to see how much fidgeting would need to be done to get such good results and to try to gauge how easy it would be to use the same technique for other problems. I can also document the parameters that weren’t specified in the paper.

The easiest dataset they used for me is MNIST, which is an (over)used collection of 70000 images of hand-written single-digit numbers (from 0 to 9) created by Corinna Cortes and Yann LeCun.  You can get it from http://yann.lecun.com/exdb/mnist/ or preprocessed so that all pixel intensity values are scaled to the range 0 to 1 and in a more semi-standard format from the libsvm dataset page (which is what I did).

The images are all grayscale 28×28 images (processed from 20×20 black and white images) with the number centered in the image and spanning most of it. The numbers were collected from real people (government employees and high-school students), and they look like normal hand-written digits in all the usual fucked up ways that people write numbers.  eg here’s a sample my computer picked at random:

The problem is known to be easy. The paper uses SVMs with Gaussian radial basis function (RBF) kernels as the baseline to compare against, and they get 98.6% right.  SVMs = black magic classifiers. How I understand them: they work by mapping all the data to higher dimensions and then drawing a hyperplane (ie a line/plane in n-1 dimensions) between the samples to divide them as well as possible (which is pretty much how I think of most classifiers). The kernels they use are functions of the dot product of the samples. I think of them as the way to measure the distances between samples and I thought they could also be interpreted as how the mapping to higher dimensions is done but I read Someone Smart saying this wasn’t the case, so it most likely isn’t (I can’t remember who or where that was, but it seemed authoritative enough to stick).

The most popular SVM package seems to be libsvm by Chih-Chung Chang and Chih-Jen Lin. It has nice command-line tools, python bindings and nice utils (also written in python) so I was sold.

There are two important hyperparameters to tweak with SVMs that use gaussian RBFs. First one is the cost (C) of misclassification (ie where to set the trade off between making the dividing hyperplane more complicated versus how much we want the plane to divide the training dataset exactly). Second one is gamma (g), which is how tight we want the Gaussian RBFs to be (g is inversely proportional to the standard deviation of those gaussians).

SVMs only classify between two classes, and there are a lot of different ways to extend that to multiclass classification (like picking from 0 to 9). One way is to build one classifier per target classification (10 in this case). When you train each classifier, you put the target class samples on one side, and all other samples on the other side (eg when training the classifier for the digit 3, you call all the samples of 3s true, and all the samples of non-3s (0s, 1s, 2s, 4s, 5s, 6s, 7s, 8s and 9s) false) . When you are testing, you see which of the classifiers says it is most likely that that target class is true.  This is called one vs all classification.

Another “simple” way to do multiclass classification, and the way that libsvm does it by default, is to build a classifier for each pair of target classes (eg one for the 1s vs the 2s, one for the 7s vs the 8s, one for the 1s vs the 7s, etc). n * (n-1) / 2 classifiers in total (45 in this case).  Then when you test a sample you run the sample on all n * (n-1) / 2 classifiers and keep a tally of which class wins most of the contests for that sample, and claim the winner as the class of the sample. This is one of the one vs one classification systems. Intuitively to me the one vs all system seems better, but libsvm and others seem to like the one vs one and claim it works better. There are many other ways of mapping binary classification to multiclass classification, and I’ll probably try some in the future.

libsvm comes with a little util called grid.py that looks for the best hyperparameters by sampling the two hyperparameters on a log/log grid, training and doing cross-validation on the data you give it.  The MNIST data is split into 60000 training samples and 10000 test samples. I picked 6000 out of the 60000 training samples to do the grid search to speed the process up.  Running grid.py with default parameters samples 110 points on the grid and took about 36 hours on my i7-870 on those 6000 samples (this was before I discovered libsvm-cuda, but I’ll write about libsvm-cuda later).

While grid.py was running I noticed that the gamma parameter was very important and the cost was completely uninteresting as long as it was sufficiently high (anything bigger than 2). grid.py came back with log2 gamma = -5 (0.03125) and cost = some number but I picked 64. I probably should have done some more fine-grained searching on the gamma parameter but I thought 36 hours was enough runtime for my computer

Training with those parameters on the whole learning set, and testing on the testing set gives me 98.57% right. Which is ridiculous. Practically no preprocessing, only scaling between 0 and 1, no coding and automated tweaking, and it gets almost all numbers right. SVMs are something I’m going to have to keep trying on other datasets.

No coding is required for the above.  Assuming you have libsvm compiled and you have, say, the libsvm source directory under “libsvm”, the full list of commands to get the above results would be:

# subselect 6000 training samples to run the grid faster
libsvm/tools/subset.py mnist.scale 6000 mnist6000.scale 
# grid to find the best hyperparameters
libsvm/tools/grid.py mnist6000.scale
# wait for a long time and read output from grid.py giving you the best hyperparameters
# use them to train a model
svm-train -c 64 -g 0.03125 mnist.scale mnist.model
# see how well that did
svm-predict mnist.scale.t mnist.model mnist.outputguesses

By the way, here are the 143 cases that it got wrong:

That’s the straight SVM result replicated. Next to replicate the interesting bit of the autoencoder prepocessing.