TR-AC-0051

TR-AC-0051 :2001.1.10

Yoichiro Matsuno, Tatsuya Yamazaki, Jun Matsuda, Shin Ishii

A Multi-agent Reinforcement Learning Method for a Partially-Observable Compatitive Game

Abstract:マルチエージェント系の一例としてカードゲームであるHeartsを取り上げ，そこでのエージェントの行動学習として，Actor-Critic型強化学習アルゴリズムと相手モデル学習を組み合わせた確率的なモデルを提案する．マルチエージェント系では、環境的な部分観測問題と、他エージェントの内部状態の非観測性と戦略による非マルコフ性といった様々な問題を内包している。相手モデルを学習することで、マルチエージェント系における非マルコフ過程問題に取り組み、また、確率的な戦略や状態推定をおこなうことで、部分観測問題へも適用を可能にしている。また、検証のためにシミュレーション実験を行いその結果を比較する。

This article proposes a reinforcement learning (RL) method based on the Actor-Critic architecture, which can be applied to partially-observable multi-agent competitive games. As an example, we consider a card game "Hearts". The RL then becomes a partially-observable Markov decision process (POMDP). However, the card distribution becomes inferable from the disclosed information as a single game proceeds. In addition, the strategy (model) of the other players can be learnable from their actual plays by repeating games. In our method, a single Hearts game is divided into three stages, and three actors are prepared so that one of them plays and learns separately in each stage. In particular, the actor for the middle stage plays so as to enlarge the expected temporal-difference (TD) error, which is calculated using the evaluation function approximated by the critic and the estimated state transition. After a learning player trained by our RL method plays several thousands training games with three heuristic players, the RL player becomes strong enough to beat the heuristic players.