Autor: |
Mike Li, Quang Dang Nguyen |
Jazyk: |
angličtina |
Rok vydání: |
2021 |
Předmět: |
|
Zdroj: |
IEEE Access, Vol 9, Pp 96641-96657 (2021) |
Druh dokumentu: |
article |
ISSN: |
2169-3536 |
DOI: |
10.1109/ACCESS.2021.3094623 |
Popis: |
Learning action policy for autonomous agents in a decentralized multi-agent environment has remained an interesting but difficult research problem. We propose to model this problem in a contextual bandit setting with delayed reward signals, in particular an individual short-term reward signal and a shared long-term reward signal. Our algorithm utilizes an approach with reward oracles to directly model these delayed reward signals and also relies on a learning scheme benefiting from the sampling guidance of an expert-designed policy. This algorithm is expected to apply to a wide range of problems, including those with constraints on accessing state transitions and those with implicit reward information. A demonstration, deployed by deep learning regressors, shows the effectiveness of the proposed algorithm in learning offensive action policy in the RoboCup Soccer 2D Simulation (RCSS) environment against a well-known adversary benchmark team compared to a baseline policy. |
Databáze: |
Directory of Open Access Journals |
Externí odkaz: |
|