supervised learning of policy networks (SL)
* 30 million positions from KGS Go Server
* from 160000 games played by KGS 6 to 9 dan human players
* pass moves were exluded from the data set.
* augmented dataset to include all 8 reflections and rotations of each position
* for each training step, sampled a randomly selected mini-batch of m samples from he augmented KGS dataset
* applied an asynchronous SGD update to maximize the log likelihood of the action
* step size alpha was initialized to 0.003, was halved every 80 million training steps, without momentum term
* mini-batch size m=16
* updates were applied asyn-ly on 50 GPUs using DistBelief
* gradients older than 100 steps were discarded
* hold out test set accuracy 57%
* larger networks achieve better accuracy but are slower to evaluate during search
* training took around 3 weeks for 340 million training steps.
reinflearning of policy networks (RL)
* use policy gradient
* RL policy network won more than 80% of games against SL policy network.
* tested against the strongest open-source Go program, Pachi, a sophisticated Monte Carlo search program, ranked at 2 amateur dan on KGS, that executes 100,000 simulations per move.
* using no search at all, RL policy network won 85% of games against Pachi.
* each iteration consisted of a mini-batch of n games played in parallel
* between the current policy network p3 that is being trained, and an opponent p3′ that uses parameters froma previous iteration, randomly sampled from a pool of opponents, so as to increase the stability of training
* every 500 iterations, added the current parameters to the opponent pool
* each game i in the mini-batch was played out until termination at step T^i
* each game i then was scored to determine the outcome z from each player’s perspective
* the games were then replayed to determine the policy gradient update, using REINFORCE algorithm
* the policy network was trained in this way for 10000 minibatches of 128 games, using 50 GPUs, for one day
reinflearning of value network
* trained a value network to approximate the value function of the RL policy network p3.
* purpose is for position evaluation
* has the same architecture to the policy network
* but outputs a single prediction instead of a probability distribution
* train weights of the value network by regression on state-outcome pairs (s, z)
* use SGD to minimize MSE
* to avoid overfitting to the strongly correlated positions within games
* built a new dataset of uncorrelated self play positions
* has over 30 million positions, each drawn from a unique game of self-play
* this dataset provide unbiased samples of the value function
* training method was identical to SL policy network training
* except that the parameter update was based on mean squared error between predicted values and observed rewards
* network was trained for 50 million mini-batches of 30 positions, using 50 GPUs, for one week.
Features for policy/value network
* each position s was pre-processed into a set of 19×19 feature planes.
* status of each intersection of Go board
* stone color, liberties, captures, legality, turns since stone was played, current color to play
* each integer feature value is split into multiple 19×19 planes of binary values (one-hot encoding)
* for policy network
* input to policy network is a 19x19x48 image stack having 48 feature planes
* first hidden layer zero pads the input into a 23×23 image
* then convolves k filters of kernel size 5×5 with stride 1 with the input image, followed by rectifier nonlinearity.
* each subsequent hidden layer 2 to 12 zero pads the respective previous hidden layer into a 21×21 image
* then convolves k filters of kernel size 3×3 with stride 1, followed by rectifier nonlinearity.
* final layer convolves 1 filter of kernel size 1×1 with stride 1, applies a softmax function
* the match version of alphago used k=192 filters
* for value network
* input is also a 19x19x48 image stack
* with additional binary feature plane describing the current color to play
* hidden layers 2 to 11 are identical to policy network
* but hidden layer 12 is an additional conv layer
* hidden layer 13 convolves 1 filter of kernel size 1×1 with stride 1
* hidden layer 14 is a fully connected linear layer with 256 rectifier units
* output layer is a fully connected linear layer with a single tanh unit
supervised learning of policy networks (SL)