Monte Carlo Tree Search (MCTS) Tutorial – Complete Guide with Examples

Master Monte Carlo Tree Search (MCTS) with this beginner-friendly guide. Learn the theory, step-by-step algorithm, and practical examples used in AI and game-playing systems.

What is MCTS?

Monte Carlo Tree Search (MCTS) is a powerful decision-making algorithm used in games and simulations. It balances exploration (trying new moves) and exploitation (choosing moves known to be good).

Quick Links

Watch Video Tutorial

MCTS Steps Diagram

1. Tree Traversal (Selection)

Select a node to explore by traversing the tree from the root using the UCB1 formula:

UCB1(S_i) = \frac{t_i}{n_i} + C \cdot \sqrt{\frac{\ln N}{n_i}}

Where:

$t_i$ = total value of node $S_i$
$n_i$ = number of visits to node $S_i$
$N$ = number of simulations from the parent node
$C$ = exploration constant

2. Node expansion

Generate one or more child nodes from the selected node.

3. Rollout (random simulation)

Play out the game from the new node using random moves until a terminal state is reached.

4. Backpropagation

Update the values of the nodes on the path taken during selection based on the result of the simulation.

🔁 This process is repeated until the algorithm converges on a solution and learns the optimal policy.

Example

You are given a snapshot of a Monte Carlo Tree Search (MCTS) in progress. Starting from this state, perform two additional rollouts with the terminal values 15 and 75. Then, determine the updated best strategy based on these rollouts. If a random choice is needed at any point, always prefer the right node.

Start from the root node and recursively select child nodes using the UCB1 formula:

UCB1(S_i) = \frac{t_i}{n_i} + C \cdot \sqrt{\frac{\ln N}{n_i}}

Rollout 1 (value = 15)

🔹 Step 1: Selection from S0

We compute UCB1 for both child nodes of S0.

Let:

$N = n_0 = 3$
$\ln N = \ln 3 ≈ 1.0986$
$C = 2$

→ For S1:

$t_1 = 60, n_1 = 2$

\text{UCB1}_{S1} = \frac{60}{2} + 2 \cdot \sqrt{\frac{1.0986}{2}} ≈ 30 + 2 \cdot 0.741 ≈ 30 + 1.482 = 31.482

→ For S2:

$t_2 = 18, n_2 = 1$

\text{UCB1}_{S2} = \frac{18}{1} + 2 \cdot \sqrt{\frac{1.0986}{1}} ≈ 18 + 2 \cdot 1.048 ≈ 18 + 2.096 = 20.096

✅ Select S1 (higher UCB1)

Monte Carlo Tree Search (MCTS) Selection Step 1

🔹 Step 2: Expansion

Since S1 has unvisited children, we expand it by adding a new child node S3.

Monte Carlo Tree Search (MCTS) Expansion Step 1

🔹 Step 3: Simulation (Rollout)

From S3, we simulate a random move leading to a terminal state with value 15.

Backpropagation phase in Monte Carlo Tree Search showing reward update from S3

🔹 Step 4: Backpropagation

Backpropagate the value 15 from S3 to S1 and S0.

Update S3, S1, S0:

S3: t3 = 15, n3 = 1
S1: t1 = 60 + 15 = 75, n1 = 2 + 1 = 3
S0: t0 = 83 + 15 = 98, n0 = 3 + 1 = 4

Monte Carlo Tree Search (MCTS) Backpropagation Step 1

Rollout 2 (value = 75)

🔹 Step 1: Selection from S0

We compute UCB1 again for S1 and S2.

→ For S1:

$t=75, n=3$

\text{UCB1}_{S1} = \frac{75}{3} + 2 \cdot \sqrt{\frac{1.386}{3}} ≈ 25 + 2 \cdot 0.68 = 25 + 1.36 = 26.36

→ For S2:

$t=18, n=1$

\text{UCB1}_{S2} = \frac{18}{1} + 2 \cdot \sqrt{\frac{1.386}{1}} ≈ 18 + 2 \cdot 1.177 = 18 + 2.354 = 20.354

✅ Select S1 again (higher UCB1)

Monte Carlo Tree Search (MCTS) Selection Step 2

🔹 Step 2: Selection from S1

Now both children visited once:

S3: $t=15, n=1$
S4: $t=40, n=1$

Let’s compute UCB1 for both (with $N=3$ , parent visits = 3):

→ For S3:

\text{UCB1}_{S3} = 15 + 2 \cdot \sqrt{\frac{1.0986}{1}} ≈ 15 + 2.096 = 17.096

→ For S4:

\text{UCB1}_{S4} = 40 + 2.096 = 42.096

✅ Select S4

Monte Carlo Tree Search (MCTS) Selection Step 3

🔹 Step 3: expansion

Since S4 has unvisited children, we expand it by adding a new child node S6.

Monte Carlo Tree Search (MCTS) Expansion Step 2

🔹 Step 4: Simulation (Rollout)

From S6, we simulate a random move leading to a terminal state with value 75.

Monte Carlo Tree Search (MCTS) Rollout Step 2

🔹 Step 5: Backpropagation

Backpropagate the value 75 from S6 to S4, S1, and S0.

Update S6, S4, S1, S0:

S6: t6 = 75, n6 = 1
S4: t4 = 40 + 75 = 115, n4 = 1 + 1 = 2
S1: t1 = 75 + 75 = 150, n1 = 3 + 1 = 4
S0: t0 = 98 + 75 = 173, n0 = 4 + 1 = 5

Monte Carlo Tree Search (MCTS) Backpropagation Step 2

Q&A

1. The Rollout Step Follows Which Algorithm?

The Rollout step in Monte Carlo Tree Search (MCTS) typically follows a default policy, which is often:

A random simulation policy (i.e., play random legal moves until the game ends).

This is not a complex algorithm — it’s usually a simple random playout to evaluate how promising a move is without deeper computation.

However, in advanced cases (like AlphaGo), the rollout can use:

Heuristic-based simulation
Or even a learned policy network

But for basic MCTS:

🔁 Rollout = random simulation from the current node to a terminal state

2. What Is a Terminal Node?

A terminal node is a state in the game where no further actions are possible — meaning the game has ended.

It could be:

A win
A loss
Or a draw

Example: Tic-Tac-Toe

Let’s say you’re building an MCTS for Tic-Tac-Toe.

Imagine this board state:

X | O | X
---------
X | O | O
---------
O | X | X

In this case:

All 9 cells are filled
No more moves can be made
It’s a draw

This board is a terminal node in the MCTS tree.

If the rollout reaches this node, the reward is evaluated (say, 0 for draw), and that value is then backpropagated up the tree.

Summary

Term	Meaning
Rollout	Simulate a game to the end using random or heuristic moves
Terminal Node	A game state where no moves are left (win/loss/draw)
Example	Full Tic-Tac-Toe board with no empty spots → terminal draw state

3. What if no unvisited nodes exist?

If all possible actions from a node have already been visited, then:

Expansion is skipped
The algorithm goes directly to simulation (rollout) from one of the already expanded child nodes

In short:

No unvisited nodes ⇒ Skip expansion ⇒ Continue with simulation and backpropagation

4. What if the node is a terminal state?

If during selection:

All children have been visited, and
The current node is a terminal state (i.e., the game has ended)

Then:

Skip expansion
Skip simulation

You directly:

Backpropagate the terminal reward (win/loss/draw)

Final Note

These four steps are repeated X times (where X is the number of iterations or simulations you want to run). Over time, the tree prioritizes better actions using a balance of exploration and exploitation, learning the optimal game policy.