Title
Actor-Critic Policy Optimization in a Large-Scale Imperfect-Information Game
Abstract
The deep policy gradient method has demonstrated promising results in many large-scale games, where the agent learns purely from its own experience. Yet, policy gradient methods with self-play suffer convergence problems to a Nash Equilibrium (NE) in multi-agent situations. Counterfactual regret minimization (CFR) has a convergence guarantee to a NE in 2-player zero-sum games, but it usually needs domain-specific abstractions to deal with large-scale games. Inheriting merits from both methods, in this paper we extend the actor-critic algorithm framework in deep reinforcement learning to tackle a large-scale 2-player zero-sum imperfect-information game, 1-on-1 Mahjong, whose information set size and game length are much larger than poker. The proposed algorithm, named Actor-Critic Hedge (ACH), modifies the policy optimization objective from originally maximizing the discounted returns to minimizing a type of weighted cumulative counterfactual regret. This modification is achieved by approximating the regret via a deep neural network and minimizing the regret via generating self-play policies using Hedge. ACH is theoretically justified as it is derived from a neural-based weighted CFR, for which we prove the convergence to a NE under certain conditions. Experimental results on the proposed 1-on-1 Mahjong benchmark and benchmarks from the literature demonstrate that ACH outperforms related state-of-the-art methods. Also, the agent obtained by ACH defeats a human champion in 1-on-1 Mahjong.
Year
Venue
Keywords
2022
International Conference on Learning Representations (ICLR)
Policy Optimization,Nash Equilibrium,Mahjong AI
DocType
Citations 
PageRank 
Conference
0
0.34
References 
Authors
0
11
Name
Order
Citations
PageRank
Haobo Fu151.11
Weiming Liu201.01
Shuang Wu300.68
Yijia Wang400.34
Tao Yang558.53
Kai Li600.34
Junliang Xing7119363.31
Bin Li878279.80
Bo Ma900.34
Qiang Fu1014.42
Wei Yang119327.50