Artificial Intelligence

Opponent Modelling for Training Monte Carlo Tree Search-Based Models in the Board Game Citadels

Arthur Camilotto, Felipe Grando, João Luis Almeida Santos, Eduardo V. P. Fiorentin, Djonatan R. C. Bonelli

November 2025Universidade Federal da Fronteira Sul (UFFS)XV Jornada de Iniciação Científica e Tecnológica — UFFS, Chapecó, 2025
View Publication

Abstract

This work investigates how opponent modelling influences the training of a hybrid MCTS-RL framework applied to the multiplayer board game Citadels, which features hidden information, multiple players, and time constraints. The proposal decouples planning and execution through a persistent decision structure built offline from simulations, reducing the online computational cost of MCTS. In a simulation environment with expert agents and a random agent, different training sessions (including a varied environment) are compared by win rate. The results indicate that the diversity of strategies in training tends to produce a more robust and generalist agent than training against individually strong opponents.

Keywords:Monte Carlo Tree SearchReinforcement LearningOpponent ModellingAgentsBoard GamesCitadels

1. Introduction

Monte Carlo Tree Search (MCTS) has consolidated itself as an effective technique for decision-making in board games, combining adaptability and low use of domain-specific knowledge. Its performance is notable in deterministic games with perfect information, such as Go (SILVER et al., 2016).

However, its application encounters significant limitations in environments with multiple players, hidden information, and time constraints, where the high computational cost of simulations and the difficulty of adaptation compromise the responsiveness and quality of decisions (POWLEY et al., 2014).

In this context, hybrid approaches that integrate MCTS and Reinforcement Learning (RL) have shown promise by allowing the planning phase to be decoupled from execution. The construction of persistent decision structures, with prior storage of the search, enables the instantaneous selection of actions during the game, reducing the need for online simulations and preserving strategic depth (BROWNE, 2012; SWIECHOWSKI, 2023).

In this research, a hybrid MCTS model with offline reinforcement learning techniques was tested and developed. The model was trained and tested using different opponents in a simulation environment of the board game Citadels (FAIDUTTI, 2016).

2. General Objective

The general objective of this research is to develop a hybrid MCTS-RL framework capable of operating efficiently in multiplayer board games with hidden information, using Citadels as a case study.

The proposal seeks to eliminate computational bottlenecks of traditional MCTS through the offline construction of a persistent decision structure, allowing for fast and robust actions at runtime.

3. Methodology

This research is characterized as an applied experimental study with quali-quantitative analysis.

The first stage involved the development and adaptation of a digital simulation environment for the board game Citadels. This was followed by the development of five expert agents with customized strategies, built to mimic real strategies used by human players, and the development of a baseline agent that chooses actions randomly. After the validation of the environment and the developed agents, the development of the hybrid MCTS-RL framework began.

The MCTS model is, in its essence, a heuristic search algorithm that operates through the execution of multiple simulations. For each distinct game state visited during these simulations, a corresponding node is created in a tree data structure. The initial game state represents the root node of this tree.

However, the proposal to store the entire tree and preserve it for future decisions is unfeasible due to the massive branching factor. To make this possible, it is necessary to abstract the tree: the game state is divided into its fundamental characteristics (e.g., player's gold, number of cards in hand) and each characteristic is mapped to an independent data table.

In this structure, the rows of a table represent the possible values of a state characteristic, and the columns represent the possible actions the agent can take (e.g., gain gold, draw cards). Two values are stored in each cell: the number of victories achieved (n) after taking that action in that state, and the total number of times this combination was explored (m). These tables, when populated through simulations of Citadels matches, configure the agent's learned model.

The training process includes a reinforcement learning component: after each simulated match, the results update the decision tables, influencing future choices. Victories act as rewards and reinforce effective actions, increasing their probability of being chosen again, while defeats reduce this possibility. Thus, over many simulations, the model adapts continuously, balancing the exploration of new alternatives and the exploitation of proven strategies, gradually converging towards more effective behaviors.

4. Results and Discussion

To evaluate the performance of the proposed MCTS model, it was crucial to understand the power dynamics between the baseline strategies (expert agents and random agent). To achieve this, each was subjected to a round of 10,000 matches against four identical opponents.

The heatmap presented in Figure 1 illustrates the performance of each strategy (Y-axis) against the different opponents (X-axis). The strategies possess distinct strengths and weaknesses: Experts 1, 2, and 3 prove competent in most scenarios, while Experts 4 and 5 are considerably less effective. As expected, the completely random strategy shows very low performance against any minimally structured opponent.

Figure 1: Comparison between strategies.

The next step was to evaluate the MCTS-RL model. Different training sessions were conducted, each focusing on one of the opponents, totaling 100,000 training matches for each version of the model. Then, the performance of each trained model was measured against all types of opponents following the same previous testing model, and against randomly drawn varied opponents.

The heatmap shown in Figure 2 presents the consolidated win rate of each model. The most significant result was that of the model trained in the varied environment, which achieved consistent performance, outperforming the other versions of the model in almost all test scenarios. This suggests that exposure to a diversity of strategies during training was fundamental for developing a more robust and generalist playing capability.

Figure 2: Comparison between trained models.

The models trained against Experts 4 and 5, considered the weakest opponents, became surprisingly effective agents. Conversely, training against Experts 1 and 3, considered strong opponents, produced models with very poor overall performance.

A possible explanation for this contrast is that advanced expert strategies tend to be highly specialized, whereas the MCTS-RL model, at the beginning of training, approaches random behavior. Against strong opponents, even potentially advantageous actions may not result in enough victories for the model to recognize patterns of success; whereas against less specialized opponents, victories occur more frequently, favoring learning. Still, this remains a hypothesis that requires confirmation.

5. Conclusion

This study demonstrated the viability of the proposed approach, which uses an abstraction of the MCTS tree to preserve the acquired knowledge between moves. The results confirmed that the model is capable of developing robust strategies for the environment.

The superior performance of the agent trained in the varied environment shows that exposure to multiple tactics is fundamental for building a generalist model. Additionally, the research indicates that the quality of the final agent depends more on the diversity than on the apparent strength of the training opponents.

As future studies, it would be interesting to explore more precisely the actual reasons why less specialized strategies resulted in more efficient models than stronger strategies. Furthermore, a comparison of the proposed model with other MCTS models would also help to obtain a more accurate assessment of the approach.