Zobrazeno 1 - 10
of 42
pro vyhledávání: '"Ye, Chenlu"'
Reverse-Kullback-Leibler (KL) regularization has emerged to be a predominant technique used to enhance policy optimization in reinforcement learning (RL) and reinforcement learning from human feedback (RLHF), which forces the learned policy to stay c
Externí odkaz:
http://arxiv.org/abs/2411.04625
This study tackles the challenges of adversarial corruption in model-based reinforcement learning (RL), where the transition dynamics can be corrupted by an adversary. Existing studies on corruption-robust RL mostly focus on the setting of model-free
Externí odkaz:
http://arxiv.org/abs/2402.08991
We investigate Reinforcement Learning from Human Feedback (RLHF) in the context of a general preference oracle. In particular, we do not assume the existence of a reward function and an oracle preference signal drawn from the Bradley-Terry model as m
Externí odkaz:
http://arxiv.org/abs/2402.07314
Autor:
Xiong, Wei, Dong, Hanze, Ye, Chenlu, Wang, Ziqi, Zhong, Han, Ji, Heng, Jiang, Nan, Zhang, Tong
This paper studies the alignment process of generative models with Reinforcement Learning from Human Feedback (RLHF). We first identify the primary challenges of existing popular methods like offline PPO and offline DPO as lacking in strategical expl
Externí odkaz:
http://arxiv.org/abs/2312.11456
We study high-dimensional multi-armed contextual bandits with batched feedback where the $T$ steps of online interactions are divided into $L$ batches. In specific, each batch collects data according to a policy that depends on previous batches and t
Externí odkaz:
http://arxiv.org/abs/2311.13180
Publikováno v:
Neurips 2023
We investigate the problem of corruption robustness in offline reinforcement learning (RL) with general function approximation, where an adversary can corrupt each sample in the offline dataset, and the corruption level $\zeta\geq0$ quantifies the cu
Externí odkaz:
http://arxiv.org/abs/2310.14550
Modern deep learning heavily relies on large labeled datasets, which often comse with high costs in terms of both manual labeling and computational resources. To mitigate these challenges, researchers have explored the use of informative subset selec
Externí odkaz:
http://arxiv.org/abs/2309.02476
Publikováno v:
ICML 2023
Despite the significant interest and progress in reinforcement learning (RL) problems with adversarial corruption, current works are either confined to the linear setting or lead to an undesired $\tilde{O}(\sqrt{T}\zeta)$ regret bound, where $T$ is t
Externí odkaz:
http://arxiv.org/abs/2212.05949
Akademický článek
Tento výsledek nelze pro nepřihlášené uživatele zobrazit.
K zobrazení výsledku je třeba se přihlásit.
K zobrazení výsledku je třeba se přihlásit.
Akademický článek
Tento výsledek nelze pro nepřihlášené uživatele zobrazit.
K zobrazení výsledku je třeba se přihlásit.
K zobrazení výsledku je třeba se přihlásit.