[H-GEN] How Alpha Go Zero works

Russell Stuart russell-humbug at stuart.id.au
Mon Nov 6 12:38:46 AEST 2017


You can read the details here:

http://tim.hibal.org/blog/alpha-zero-how-and-why-it-works/

In brief (well it turns out this is not so brief): they used a
technique pioneered by DeepMind, the AI solved the Atari Breakout game.
 (Google Purchased DeepMind Technologies Limited in 2014.)  Solving
games like boils down to exploring the tree of all possible moves at a
given stage of play.  It's difficult because the tree is so huge
exploring it all is impossible.

Since it's impossible they use a Monte Carlo algorithm.  Monte Carlo is
a fancy way of saying choose move at random and see whether it
"improves things".  But of course telling whether something has
improved after just one move is also near impossible, so they have to
go down several moves - randomly choosing each next move.  How you
define "improves things" depends on the game - in checkers it might be
taking a piece, or gaining a queen.

How many  moves it takes for something like this to happen also depends
on the game and perhaps on the stage - so for example in the opening
stages of chess it might take around 5 moves for something interesting
to happen.  So they randomly make 5 moves - then assign that branch +1
if it looks better for the AI, -1 if it looks better for the opponent,
and 0 if they don't know.  They evaluate this score for every possible
move on the board.

Obviously these -1, 0, +1 scores per move aren't that useful (in fact
they are bloody horrible), so now they have to be improved.  This is
where the shear speed of a computer comes in, because unlike a human a
computer can evaluate this score for every possible move on the board
thousands if not millions of times in the duration of a normal human
turn.  So they choose most promising move based on their -1, 0, +1
scores and score every possible move using the new position as the
starting position (ie, repeat the process), then adjust the original
-1, 0, +1 score depending on the outcome.  There is a formula
(described in the link, I presume based on information theory so it's
optimal in some fashion) they used combine the original -1, 0, +1 with
the new information to yield a new score for the branch.

The actual formula is drop dead simple, but not that important - what
is important is it takes 2 parameters and so far I've only mentioned
one - the score's achieved.  The second parameter is the how many moves
under each branch has been explored.  They favour under-explored
branches.

So that's how a basic Monte Carlo game player works, and this is how
the chess / checkers / poker programs that defeated humans work.  But
notice there is no computer driven learning happening here.  There is
learning going on - but it was done by the computer programmers fine
tuning how they scored positions, making the random moves smarter,
fining tuning the optimal depth, figuring out how much they favoured
unexplored regions, and hardware engineers designing custom circuits to
evaluate positions.  (The secret sauce that makes one computer better
than another at playing a game is in these tweaks.  In fact without
them the raw Monte Carlo algorithm performs appalling badly.)

It is these human programmers the AI in Alpha Go replaced.  In fact
figuring out how to replace them was DeepMind's contribution to the
art.  The contribution was just to weight (literally: multiply) the
score of each branch with the output of an AI given the current state
of the board.

The difference between Alpha Go and Alpha Go Zero was how the AI is
trained.  Neural networks are created by giving them a sample to
evaluate, looking at the output and then tweaking them slightly to
nudge them towards the desired result.   It's a horribly inefficient
process requiring an enormous number of samples fed to it multiple
times, and consequently a prodigious amount of CPU time.

Alpha Go was given games played by experts, so it was shown a board and
the next move played by the human expert was considered the best one. 
If it didn't make it, it was given a nudge.  This suffers from a couple
of problems: firstly there is a limited amount of training data, and
second it was never going to get much better than those human experts. 
Alpha Go Zero trained against itself.  When confronted with a set of
similar looking positions it selected one at random in the usual Monte
Carlo way, which eventually lead to one side winning and the neural
network was adjusted accordingly.  That neatly solves both problems.
 
The speed it can learn at is only limited by the amount of computing
power you can throw at it.  The end result was that it took Alpha Go
Zero 3 days to go from knowing nothing about the game to beating the
best human on the planet, compared to the decades it takes a human to
achieve the same thing.


More information about the General mailing list