Using Posterior Variance Estimates to Improve Exploration in Monte Carlo Tree Search

Reformulated the MCTS value estimate as a Gaussian posterior over each node’s true value, propagating both mean and variance up the tree during back-up to capture epistemic uncertainty in unexplored sub-trees. Replaced the standard UCB1 exploration bonus with a posterior-variance-based bonus, using Thompson Sampling to select the action branch to visit next rather than relying on visit-count heuristics alone. Demonstrated improved sample efficiency and stronger final policies on benchmark planning tasks compared to vanilla UCT at matched simulation budgets.

Dixant Mittal
Dixant Mittal

My research interests include reinforcement learning, planning & search, large language models, and decision-making under uncertainty.