Zobrazeno 1 - 10
of 3 289
pro vyhledávání: '"A, Ghavamzadeh"'
Conservative Contextual Bandits (CCBs) address safety in sequential decision making by requiring that an agent's policy, along with minimizing regret, also satisfies a safety constraint: the performance is not worse than a baseline policy (e.g., the
Externí odkaz:
http://arxiv.org/abs/2412.06165
In Markov decision processes (MDPs), quantile risk measures such as Value-at-Risk are a standard metric for modeling RL agents' preferences for certain outcomes. This paper proposes a new Q-learning algorithm for quantile optimization in MDPs with st
Externí odkaz:
http://arxiv.org/abs/2410.24128
Autor:
Kim, Kyuyoung, Jeong, Jongheon, An, Minyong, Ghavamzadeh, Mohammad, Dvijotham, Krishnamurthy, Shin, Jinwoo, Lee, Kimin
Fine-tuning text-to-image models with reward functions trained on human feedback data has proven effective for aligning model behavior with human intent. However, excessive optimization with such reward models, which serve as mere proxy objectives, c
Externí odkaz:
http://arxiv.org/abs/2404.01863
The goal of an offline reinforcement learning (RL) algorithm is to learn optimal polices using historical (offline) data, without access to the environment for online exploration. One of the main challenges in offline RL is the distribution shift whi
Externí odkaz:
http://arxiv.org/abs/2310.18434
Autor:
Biyik, Erdem, Yao, Fan, Chow, Yinlam, Haig, Alex, Hsu, Chih-wei, Ghavamzadeh, Mohammad, Boutilier, Craig
Preference elicitation plays a central role in interactive recommender systems. Most preference elicitation approaches use either item queries that ask users to select preferred items from a slate, or attribute queries that ask them to express their
Externí odkaz:
http://arxiv.org/abs/2311.02085
Autor:
Jeong, Jihwan, Chow, Yinlam, Tennenholtz, Guy, Hsu, Chih-Wei, Tulepbergenov, Azamat, Ghavamzadeh, Mohammad, Boutilier, Craig
Recommender systems (RSs) play a central role in connecting users to content, products, and services, matching candidate items to users based on their preferences. While traditional RSs rely on implicit user feedback signals, conversational RSs inter
Externí odkaz:
http://arxiv.org/abs/2310.06176
Publikováno v:
International Conference on Machine Learning, 2024
We study how to make decisions that minimize Bayesian regret in offline linear bandits. Prior work suggests that one must take actions with maximum lower confidence bound (LCB) on their reward. We argue that the reliance on LCB is inherently flawed i
Externí odkaz:
http://arxiv.org/abs/2306.01237
Autor:
Fan, Ying, Watkins, Olivia, Du, Yuqing, Liu, Hao, Ryu, Moonkyung, Boutilier, Craig, Abbeel, Pieter, Ghavamzadeh, Mohammad, Lee, Kangwook, Lee, Kimin
Learning from human feedback has been shown to improve text-to-image models. These techniques first learn a reward function that captures what humans care about in the task and then improve the models based on the learned reward function. Even though
Externí odkaz:
http://arxiv.org/abs/2305.16381
Autor:
Bravo-Hermsdorff, Gecia, Busa-Fekete, Róbert, Ghavamzadeh, Mohammad, Medina, Andres Muñoz, Syed, Umar
Modern statistical estimation is often performed in a distributed setting where each sample belongs to a single user who shares their data with a central server. Users are typically concerned with preserving the privacy of their samples, and also wit
Externí odkaz:
http://arxiv.org/abs/2305.07751
Publikováno v:
Advances in Neural Information Processing Systems (Neurips), 2023
Optimizing static risk-averse objectives in Markov decision processes is difficult because they do not admit standard dynamic programming equations common in Reinforcement Learning (RL) algorithms. Dynamic programming decompositions that augment the
Externí odkaz:
http://arxiv.org/abs/2304.12477