Zobrazeno 1 - 10
of 26
pro vyhledávání: '"Zhang, Shenao"'
Autor:
Zhang, Shenao, Yu, Donghan, Sharma, Hiteshi, Yang, Ziyi, Wang, Shuohang, Hassan, Hany, Wang, Zhaoran
Preference optimization, particularly through Reinforcement Learning from Human Feedback (RLHF), has achieved significant success in aligning Large Language Models (LLMs) to adhere to human intentions. Unlike offline alignment with a fixed dataset, o
Externí odkaz:
http://arxiv.org/abs/2405.19332
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer
Autor:
Liu, Zhihan, Lu, Miao, Zhang, Shenao, Liu, Boyi, Guo, Hongyi, Yang, Yingxiang, Blanchet, Jose, Wang, Zhaoran
Aligning generative models with human preference via RLHF typically suffers from overoptimization, where an imperfectly learned reward model can misguide the generative model to output undesired responses. We investigate this problem in a principled
Externí odkaz:
http://arxiv.org/abs/2405.16436
Autor:
Zhang, Shenao, Zheng, Sirui, Ke, Shuqi, Liu, Zhihan, Jin, Wanxin, Yuan, Jianbo, Yang, Yingxiang, Yang, Hongxia, Wang, Zhaoran
Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback. However, RL algorithms may require extensive trial-and-error interactions to collect usef
Externí odkaz:
http://arxiv.org/abs/2402.16181
ReParameterization (RP) Policy Gradient Methods (PGMs) have been widely adopted for continuous control tasks in robotics and computer graphics. However, recent studies have revealed that, when applied to long-term reinforcement learning problems, mod
Externí odkaz:
http://arxiv.org/abs/2310.19927
Large language models (LLMs) demonstrate impressive reasoning abilities, but translating reasoning into actions in the real world remains challenging. In particular, it remains unclear how to complete a given task provably within a minimum number of
Externí odkaz:
http://arxiv.org/abs/2309.17382
Autor:
Liu, Zhihan, Lu, Miao, Xiong, Wei, Zhong, Han, Hu, Hao, Zhang, Shenao, Zheng, Sirui, Yang, Zhuoran, Wang, Zhaoran
In online reinforcement learning (online RL), balancing exploration and exploitation is crucial for finding an optimal policy in a sample-efficient way. To achieve this, existing sample-efficient online RL algorithms typically consist of three compon
Externí odkaz:
http://arxiv.org/abs/2305.18258
With strong capabilities of reasoning and a broad understanding of the world, Large Language Models (LLMs) have demonstrated immense potential in building versatile embodied decision-making agents capable of executing a wide array of tasks. Neverthel
Externí odkaz:
http://arxiv.org/abs/2305.15695
Autor:
Zhang, Shenao
Provably efficient Model-Based Reinforcement Learning (MBRL) based on optimism or posterior sampling (PSRL) is ensured to attain the global optimality asymptotically by introducing the complexity measure of the model. However, the complexity might gr
Externí odkaz:
http://arxiv.org/abs/2209.07676
In multi-agent reinforcement learning, the behaviors that agents learn in a single Markov Game (MG) are typically confined to the given agent number. Every single MG induced by varying the population may possess distinct optimal joint strategies and
Externí odkaz:
http://arxiv.org/abs/2108.12988
Publikováno v:
In Heliyon 30 June 2024 10(12)