supervised learning of policy networks (SL)

* 13-layer

* 30 million positions from KGS Go Server

* from 160000 games played by KGS 6 to 9 dan human players

* pass moves were exluded from the data set.

* augmented dataset to include all 8 reflections and rotations of each position

* for each training step, sampled a randomly selected mini-batch of m samples from he augmented KGS dataset

* applied an asynchronous SGD update to maximize the log likelihood of the action

* step size alpha was initialized to 0.003, was halved every 80 million training steps, without momentum term

* mini-batch size m=16

* updates were applied asyn-ly on 50 GPUs using DistBelief

* gradients older than 100 steps were discarded

* hold out test set accuracy 57%

* larger networks achieve better accuracy but are slower to evaluate during search

* training took around 3 weeks for 340 million training steps.

reinflearning of policy networks (RL)

* use policy gradient

* RL policy network won more than 80% of games against SL policy network.

* tested against the strongest open-source Go program, Pachi, a sophisticated Monte Carlo search program, ranked at 2 amateur dan on KGS, that executes 100,000 simulations per move.

* using no search at all, RL policy network won 85% of games against Pachi.

* each iteration consisted of a mini-batch of n games played in parallel

* between the current policy network p3 that is being trained, and an opponent p3′ that uses parameters froma previous iteration, randomly sampled from a pool of opponents, so as to increase the stability of training

* every 500 iterations, added the current parameters to the opponent pool

* each game i in the mini-batch was played out until termination at step T^i

* each game i then was scored to determine the outcome z from each player’s perspective

* the games were then replayed to determine the policy gradient update, using REINFORCE algorithm

* the policy network was trained in this way for 10000 minibatches of 128 games, using 50 GPUs, for one day

reinflearning of value network

* trained a value network to approximate the value function of the RL policy network p3.

* purpose is for position evaluation

* has the same architecture to the policy network

* but outputs a single prediction instead of a probability distribution

* train weights of the value network by regression on state-outcome pairs (s, z)

* use SGD to minimize MSE

* to avoid overfitting to the strongly correlated positions within games

* built a new dataset of uncorrelated self play positions

* has over 30 million positions, each drawn from a unique game of self-play

* this dataset provide unbiased samples of the value function

* training method was identical to SL policy network training

* except that the parameter update was based on mean squared error between predicted values and observed rewards

* network was trained for 50 million mini-batches of 30 positions, using 50 GPUs, for one week.

Features for policy/value network

* each position s was pre-processed into a set of 19×19 feature planes.

* status of each intersection of Go board

* stone color, liberties, captures, legality, turns since stone was played, current color to play

* each integer feature value is split into multiple 19×19 planes of binary values (one-hot encoding)

NN architecture

* for policy network

* input to policy network is a 19x19x48 image stack having 48 feature planes

* first hidden layer zero pads the input into a 23×23 image

* then convolves k filters of kernel size 5×5 with stride 1 with the input image, followed by rectifier nonlinearity.

* each subsequent hidden layer 2 to 12 zero pads the respective previous hidden layer into a 21×21 image

* then convolves k filters of kernel size 3×3 with stride 1, followed by rectifier nonlinearity.

* final layer convolves 1 filter of kernel size 1×1 with stride 1, applies a softmax function

* the match version of alphago used k=192 filters

* for value network

* input is also a 19x19x48 image stack

* with additional binary feature plane describing the current color to play

* hidden layers 2 to 11 are identical to policy network

* but hidden layer 12 is an additional conv layer

* hidden layer 13 convolves 1 filter of kernel size 1×1 with stride 1

* hidden layer 14 is a fully connected linear layer with 256 rectifier units

* output layer is a fully connected linear layer with a single tanh unit

# READING NOTES: ALPHAGO (PART 2)

Advertisements