Zobrazeno 1 - 8
of 8
pro vyhledávání: '"Gan, Yaozhong"'
The imbalance of exploration and exploitation has long been a significant challenge in reinforcement learning. In policy optimization, excessive reliance on exploration reduces learning efficiency, while over-dependence on exploitation might trap age
Externí odkaz:
http://arxiv.org/abs/2408.09974
Proximal Policy Optimization (PPO) is a popular model-free reinforcement learning algorithm, esteemed for its simplicity and efficacy. However, due to its inherent on-policy nature, its proficiency in harnessing data from disparate policies is constr
Externí odkaz:
http://arxiv.org/abs/2406.03894
On-policy reinforcement learning methods, like Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO), often demand extensive data per update, leading to sample inefficiency. This paper introduces Reflective Policy Optimizatio
Externí odkaz:
http://arxiv.org/abs/2406.03678
Advantage Learning (AL) seeks to increase the action gap between the optimal action and its competitors, so as to improve the robustness to estimation errors. However, the method becomes problematic when the optimal action induced by the approximated
Externí odkaz:
http://arxiv.org/abs/2203.11677
Advantage learning (AL) aims to improve the robustness of value-based reinforcement learning against estimation errors with action-gap-based regularization. Unfortunately, the method tends to be unstable in the case of function approximation. In this
Externí odkaz:
http://arxiv.org/abs/2203.10445
Learning complicated value functions in high dimensional state space by function approximation is a challenging task, partially due to that the max-operator used in temporal difference updates can theoretically cause instability for most linear or no
Externí odkaz:
http://arxiv.org/abs/2012.09456
Proximal policy optimization (PPO) is one of the most popular deep reinforcement learning (RL) methods, achieving state-of-the-art performance across a wide range of challenging tasks. However, as a model-free RL method, the success of PPO relies hea
Externí odkaz:
http://arxiv.org/abs/1901.10314
Publikováno v:
In Pattern Recognition November 2022 131