MCTS in AlphaGo
* MCTS selects actions by lookahead search
* see Fig 3 of paper
* each edge (s, a) of the search tree stores an action value Q(s,a), visit count N(s,a), and prior probability P(s,a).
* tree is traversed by simulation, start from the root state
* simulation means, descending the tree in complete games without backup
* at each time step t of each simulation, an action a_t is selected from state s_t:
* a_t = argmax( Q(s_t, a), u(s_t, a) )
* so as to maximize action value plus a bonus: u(s,a) ~~~ P(s,a) / (1 + N(s,a)), where ~~~ means propotional.
* this u(s,a) is proportitional to the prior probability but decays with repeated visits to encourage exploration
* when the traversal reaches a leaf notes s_L at step L, the leaf node may be expanded
* the leaf position s_L is processed just once by the SL policy netowrk p1
* the output probabilities are stored as prior probabilities P for each legal action a, P(s,a) = p1(a|s)
* the leaf node is evaluated in two very different ways
* first, by the value network v1(s_L)
* second, by the outcome z_L of a random rollout played out until terminal step T using the fast rollout policy p2
* those two are combined to get final leaf evaluation
* V(s_L) = (1 – lambda) v1(s_L) + lambda * z_L
* at the end of simulation, the action values are visit counts of all traversed edges are updated
* each edge accumulates the visit count and mean evaluation of all simulations passing thru that edge
* once search is complete, the algo chooses the most visited move from the root position
Details of MCTS in alphago
* use APV-MCTS (asynchronous policy and value MCTS)
* multiple simulations are executed in parallel on separate search threads
* the first in-tree phase of each simulation begins at the root of tree, and finishes when simulation reaches a leaf node at time step L.
* at each of those time steps, t<L, an action is selected according to the statistics in search tree
* a_t = argmax ( Q(s_t, a) + u(s_t, a) ) using a variant of PUCT algo
* this search control strategy initially prefers actions with high prior probability and low visit count
* asymptotically prefers actions with high action value
* the leaf position s_L is added to a queue for evaluation v(s_L) by the value network, unless it has previously been evaluated
* the second rollout phase of each simulation begins at leaf node s_L and continues until end of game
* at each of these time-steps, t>=L, actions are selected by both players according to the rollout policy
* when game reaches a terminal state, the outcome z_t is computed from the final score.
* all updates are performed lock-free
* when the visit count exceeds an expansion threshold 40, the successor state s’ is added to search tree.
* the new node is initialized to some default values (like 0). and provide placeholder prior probabilities for action selection.
* the position s’ is also inserted into a queue for asynchronous GPU evaluation by the policy network
* prior probabilities are computed by the SL policy newtork, with a softmax template parameter.
* the threshold is adjusted dynamically to ensure that the rate at which positions are added to the policy queue matches the rate at which the GPUs evaluate the policy network.
* positions are evaluated by both the policy network and value network using a mini-batch size of 1 to minimize end-to-end evaluation time.
* there is a things called virtual loss
* at each in-tree step t<=L of the simulation, the rollout statistics are updated as if it has lost n_vl games
* at the end of the simulation, the rollout statistics are updated in a backward pass through each step t<=L, replacing the virtual losses by the outcome.
* asynchronomously, a separate backward pass is initiated when the evaluation of the leaf position s_L completes.
* SL policy network p1 performed better in alphago than the stronger RL policy newtork p3, presumbaly because humans select a diverse beam of promising moves; whereas RL optimizes for the single best move.
* however, the value function derived from the stronger RL policy network p3 performed better in alphago than a value function derived from the SL policy network.
* evaluating policy and value network requires several orders of magnitude more computation than tranditional search heuristics.
How to efficiently combine MCTS with DNN in alphago?
* use an asynchronuous multi-threaded search that executes simulations on CPUs
* computes policy and value networks in parallel on GPUs
* final version of alphago used
* 40 search threads
* 48 CPUs
* 8 GPUs
* implemented a distributed version of alphago that
* exploited multiple machines
* 40 search threads
* 1202 CPUs and 176 GPUs
* consistes of a single master machine that executes the main search
* many remote worker CPUs that execute asynchronous rollouts
* many remote worker GPUs that execute asynchronous policy and value newtork evaluations.
* entire search tree is stored on the master, which only executes the in-tree phase of each simulation.
* the leaf positions are communicated to the worker CPUs, which execute the rollout phase of simulation, and to the worker GPUs, which compute network features and evaluate the policy and value networks.
* the prior probabilities of the policy network are returned to the master, where they replace placeholder prior probabiilties at the newly expanded node.
* the rewards from rollouts and the value network outputs are each returned to the master, and backed up the originating search path.
How alphago select action?
* at the end of tree search, alphago selects the action with maximum visit count
* this is less sensitive to outliers than maximizing action value.
* the search tree is reused at subsequent time steps:
* the child node corresponding to the played action becomes the new root node
* the subtree below this child is retained along with all its statistics
* the reminder of the tree is discarded
* the match version of alphago continues searching during the opponents’s move
* it extends the search if the action maximizing visit count and the action maximizing action value disagree.
When will alphago resign?
* it resigns when its overall evaluation drops below an estimated 10% probability of winning the game.
What alphago doesn’t use?
* it does not employ the all-moves-as-first or rapid action value estimation heuristics used in the majority of Monte Carlo Go programs
* does not use progressive widening, dynamic komi, or an opening book.
How large is the search tree?
Evaluate the playing strength of alphago
* run an internal tournament among variants of alphago and several other Go programs
* including the strongest commercial programs Crazy Stone and Zen
* including the strongest open source program Pachi and Fuego
* included the open source program GnuGo, which uses state-of-the-art search methods that preceded MCTS
* all programs were allowed 5s of computation time per move.
Results of tournament shows:
* single machine alphago is many dan ranks stronger than any previous Go program, winning 494 out of 495 games (99.8%)
* distributed version of alphago was significantly stronger, winning 77% of games against single-machine alaphgo and 100% of its games against other programs.
AlphaGo vs Deep Blue
* during match against Fan Hui, alphago evaluated thousands of times fewer positions than Deep Blue did in its chess match against Kasparov.
* deep blue relied on a handcrafted evaluation function, the NN of alhpago are trained directly from gameplay purely thru general-purpose supervised and reinforcement learning methods.
Go is challanging:
* a challenging decision-making task
* an intractable search space
* an optimal solution so complex it appears infeasible to directly approximate using a policy or value function.
* alphago’s use of value functions is based on truncated Monte Carlo search algo, which terminate rollouts before the end of the game and use a value function in place of terminal reward.
* alphago’s position evaluation mixes full rollouts with truncated rollouts, resembling in some respects the well-known temporal-difference learning algorithm TD(lambda)
* the performance of MCTS is to a large degree determined by the quality of the rollout policy
* it is known that rollout-based position evaluation is frequently inaccurate.
* alphago uses relatively simple rollouts, and instead addresses the challanging problem of position evaluation more directly using value networks.
* similar to policy network, the weights of rollout policy are trained from 8 million positions from human games on the Tygem server to maximize log likelihood by SGD.
* rollouts execute at around 1000 simulations per second per CPU thread on an empty board.
* a small number of handcrafted local features encode common-sense Go rules.
* the rollout policy contains less handcrafted knowledge than state-of-the-art Go programs.
* exploit the higher-quality action selection within MCTS
* caches all moves from the search tree and then plays similar moves during rollouts
* at every step of tree traversal, the most probable action is inserted into a hash table, along with the 3×3 pattern context (color, liberty, stone counts) around both previous move and current move.
* at each step of rollout, the pattern context is matched against the hash table.
* if a match is found then the stored move is played with high probability.
* exploit symmetries at run-time by dynamically transforming each position s using the dihedral group of eight reflections and rotations.