Zobrazeno 1 - 10
of 22
pro vyhledávání: '"ZHAO Heyang"'
Reverse-Kullback-Leibler (KL) regularization has emerged to be a predominant technique used to enhance policy optimization in reinforcement learning (RL) and reinforcement learning from human feedback (RLHF), which forces the learned policy to stay c
Externí odkaz:
http://arxiv.org/abs/2411.04625
Contextual dueling bandits, where a learner compares two options based on context and receives feedback indicating which was preferred, extends classic dueling bandits by incorporating contextual information for decision-making and preference learnin
Externí odkaz:
http://arxiv.org/abs/2404.06013
Offline reinforcement learning (RL), where the agent aims to learn the optimal policy based on the data collected by a behavior policy, has attracted increasing attention in recent years. While offline RL with linear function approximation has been e
Externí odkaz:
http://arxiv.org/abs/2310.01380
Dueling bandits is a prominent framework for decision-making involving preferential feedback, a valuable feature that fits various applications involving human interaction, such as ranking, information retrieval, and recommendation systems. While sub
Externí odkaz:
http://arxiv.org/abs/2310.00968
Recently, several studies (Zhou et al., 2021a; Zhang et al., 2021b; Kim et al., 2021; Zhou and Gu, 2022) have provided variance-dependent regret bounds for linear contextual bandits, which interpolates the regret for the worst-case regime and the det
Externí odkaz:
http://arxiv.org/abs/2302.10371
We study reinforcement learning (RL) with linear function approximation. For episodic time-inhomogeneous linear Markov decision processes (linear MDPs) whose transition probability can be parameterized as a linear function of a given feature mapping,
Externí odkaz:
http://arxiv.org/abs/2212.06132
We study the problem of online generalized linear regression in the stochastic setting, where the label is generated from a generalized linear model with possibly unbounded additive noise. We provide a sharp analysis of the classical follow-the-regul
Externí odkaz:
http://arxiv.org/abs/2202.13603
We study the linear contextual bandit problem in the presence of adversarial corruption, where the interaction between the player and a possibly infinite decision set is contaminated by an adversary that can corrupt the reward up to a corruption leve
Externí odkaz:
http://arxiv.org/abs/2110.12615
Publikováno v:
In Chinese Chemical Letters January 2021 32(1):243-257
Akademický článek
Tento výsledek nelze pro nepřihlášené uživatele zobrazit.
K zobrazení výsledku je třeba se přihlásit.
K zobrazení výsledku je třeba se přihlásit.