[H-GEN] How Alpha Go Zero works

Mon Nov 6 13:11:05 AEST 2017

Good summary Russell.

As a former Go player I am very interested in all this. I have been 
watching reviews of games against Lee Sedol, Ke Jie, the 60 games as 
Master and the recent self play games of AG. The review are by Michael 
Redmond (9 Dan pro). For those interested check out the American Go 
Association youtube channel. Haylee's channel is also good (she was 
involved in setting up the matches against Lee Sedol).

Some of his, and other professionals observations:

AG often looses points in the end game as it chooses between several 
winning moves and it might pick a move that only wins by a small margin 
- it doesn't care. The professionals don't really like this as they 
strive for "perfect end games".

When AG realizes it is loosing it often plays weird moves (as it can't 
find a winning move).

AG is very good at finding multi-purpose moves in the center of the 
board in the middle game.

AG is very good a using bad aji (i.e. stones it has sacrificed but has 
potential for causing trouble for the opponent in the future).

It's interesting how AG deals with groups that are not fully alive. 
Human players invest quite a lot of emotional energy in either saving 
their own groups or killing their opponents groups. AG doesn't care - it 
is always looking for the best move (without emotion) and so in some 
games it will sacrifice a group that might give the opponent 40 or so 
points only to get 50 points by killing an opponents group that was 
close to being alive (or simply expanding some territory).

It appears that the Korean professionals think AG zero plays more human 
like moves than earlier versions.

AG has invented some new Joseki's (standard sequences of moves) that 
professions are exploring in their own games. It has also decided that 
some well established human Joseki's aren't that great after all.

Peter

On 06/11/17 12:38, Russell Stuart wrote:
> [ Humbug *General* list - semi-serious discussions about Humbug and     ]
> [ Unix-related topics. Posts from non-subscribed addresses will vanish. ]
>
> You can read the details here:
>
> http://tim.hibal.org/blog/alpha-zero-how-and-why-it-works/
>
> In brief (well it turns out this is not so brief): they used a
> technique pioneered by DeepMind, the AI solved the Atari Breakout game.
>   (Google Purchased DeepMind Technologies Limited in 2014.)  Solving
> games like boils down to exploring the tree of all possible moves at a
> given stage of play.  It's difficult because the tree is so huge
> exploring it all is impossible.
>
> Since it's impossible they use a Monte Carlo algorithm.  Monte Carlo is
> a fancy way of saying choose move at random and see whether it
> "improves things".  But of course telling whether something has
> improved after just one move is also near impossible, so they have to
> go down several moves - randomly choosing each next move.  How you
> define "improves things" depends on the game - in checkers it might be
> taking a piece, or gaining a queen.
>
> How many  moves it takes for something like this to happen also depends
> on the game and perhaps on the stage - so for example in the opening
> stages of chess it might take around 5 moves for something interesting
> to happen.  So they randomly make 5 moves - then assign that branch +1
> if it looks better for the AI, -1 if it looks better for the opponent,
> and 0 if they don't know.  They evaluate this score for every possible
> move on the board.
>
> Obviously these -1, 0, +1 scores per move aren't that useful (in fact
> they are bloody horrible), so now they have to be improved.  This is
> where the shear speed of a computer comes in, because unlike a human a
> computer can evaluate this score for every possible move on the board
> thousands if not millions of times in the duration of a normal human
> turn.  So they choose most promising move based on their -1, 0, +1
> scores and score every possible move using the new position as the
> starting position (ie, repeat the process), then adjust the original
> -1, 0, +1 score depending on the outcome.  There is a formula
> (described in the link, I presume based on information theory so it's
> optimal in some fashion) they used combine the original -1, 0, +1 with
> the new information to yield a new score for the branch.
>
> The actual formula is drop dead simple, but not that important - what
> is important is it takes 2 parameters and so far I've only mentioned
> one - the score's achieved.  The second parameter is the how many moves
> under each branch has been explored.  They favour under-explored
> branches.
>
> So that's how a basic Monte Carlo game player works, and this is how
> the chess / checkers / poker programs that defeated humans work.  But
> notice there is no computer driven learning happening here.  There is
> learning going on - but it was done by the computer programmers fine
> tuning how they scored positions, making the random moves smarter,
> fining tuning the optimal depth, figuring out how much they favoured
> unexplored regions, and hardware engineers designing custom circuits to
> evaluate positions.  (The secret sauce that makes one computer better
> than another at playing a game is in these tweaks.  In fact without
> them the raw Monte Carlo algorithm performs appalling badly.)
>
> It is these human programmers the AI in Alpha Go replaced.  In fact
> figuring out how to replace them was DeepMind's contribution to the
> art.  The contribution was just to weight (literally: multiply) the
> score of each branch with the output of an AI given the current state
> of the board.
>
> The difference between Alpha Go and Alpha Go Zero was how the AI is
> trained.  Neural networks are created by giving them a sample to
> evaluate, looking at the output and then tweaking them slightly to
> nudge them towards the desired result.   It's a horribly inefficient
> process requiring an enormous number of samples fed to it multiple
> times, and consequently a prodigious amount of CPU time.
>
> Alpha Go was given games played by experts, so it was shown a board and
> the next move played by the human expert was considered the best one.
> If it didn't make it, it was given a nudge.  This suffers from a couple
> of problems: firstly there is a limited amount of training data, and
> second it was never going to get much better than those human experts.
> Alpha Go Zero trained against itself.  When confronted with a set of
> similar looking positions it selected one at random in the usual Monte
> Carlo way, which eventually lead to one side winning and the neural
> network was adjusted accordingly.  That neatly solves both problems.
>   
> The speed it can learn at is only limited by the amount of computing
> power you can throw at it.  The end result was that it took Alpha Go
> Zero 3 days to go from knowing nothing about the game to beating the
> best human on the planet, compared to the decades it takes a human to
> achieve the same thing.
> _______________________________________________
> General mailing list
> General at lists.humbug.org.au
> http://lists.humbug.org.au/mailman/listinfo/general
>