Popis: |
In deep reinforcement learning, off-policy data help reduce on-policy interaction with the environment, and the trust region policy optimization (TRPO) method is efficient to stabilize the policy optimization procedure. In this article, we propose an off-policy TRPO method, off-policy TRPO, which exploits both on- and off-policy data and guarantees the monotonic improvement of policies. A surrogate objective function is developed to use both on- and off-policy data and keep the monotonic improvement of policies. We then optimize this surrogate objective function by approximately solving a constrained optimization problem under arbitrary parameterization and finite samples. We conduct experiments on representative continuous control tasks from OpenAI Gym and MuJoCo. The results show that the proposed off-policy TRPO achieves better performance in the majority of continuous control tasks compared with other trust region policy-based methods using off-policy data. |