Zobrazeno 1 - 9
of 9
pro vyhledávání: '"Bhandari, Jalaj"'
Autor:
Zhu, Zheqing, Braz, Rodrigo de Salvo, Bhandari, Jalaj, Jiang, Daniel, Wan, Yi, Efroni, Yonathan, Wang, Liyuan, Xu, Ruiyang, Guo, Hongbo, Nikulkov, Alex, Korenkevych, Dmytro, Dogan, Urun, Cheng, Frank, Wu, Zheng, Xu, Wanqiao
Reinforcement Learning (RL) offers a versatile framework for achieving long-term goals. Its generality allows us to formalize a wide range of problems that real-world intelligent systems encounter, such as dealing with delayed rewards, handling parti
Externí odkaz:
http://arxiv.org/abs/2312.03814
Autor:
Xu, Ruiyang, Bhandari, Jalaj, Korenkevych, Dmytro, Liu, Fan, He, Yuchen, Nikulkov, Alex, Zhu, Zheqing
Auction-based recommender systems are prevalent in online advertising platforms, but they are typically optimized to allocate recommendation slots based on immediate expected return metrics, neglecting the downstream effects of recommendations on use
Externí odkaz:
http://arxiv.org/abs/2305.13747
Autor:
Bhandari, Jalaj
Reinforcement learning (RL) has attracted rapidly increasing interest in the machine learning and artificial intelligence communities in the past decade. With tremendous success already demonstrated for Game AI, RL offers great potential for applicat
Autor:
Bhandari, Jalaj, Russo, Daniel
We revisit the finite time analysis of policy gradient methods in the one of the simplest settings: finite state and action MDPs with a policy class consisting of all stochastic policies and with exact gradient evaluations. There has been some recent
Externí odkaz:
http://arxiv.org/abs/2007.11120
Autor:
Bhandari, Jalaj, Russo, Daniel
Policy gradients methods apply to complex, poorly understood, control problems by performing stochastic gradient descent over a parameterized class of polices. Unfortunately, even for simple control problems solvable by standard dynamic programming t
Externí odkaz:
http://arxiv.org/abs/1906.01786
Temporal difference learning (TD) is a simple iterative algorithm used to estimate the value function corresponding to a given policy in a Markov decision process. Although TD is one of the most widely used algorithms in reinforcement learning, its t
Externí odkaz:
http://arxiv.org/abs/1806.02450
Akademický článek
Tento výsledek nelze pro nepřihlášené uživatele zobrazit.
K zobrazení výsledku je třeba se přihlásit.
K zobrazení výsledku je třeba se přihlásit.
Akademický článek
Tento výsledek nelze pro nepřihlášené uživatele zobrazit.
K zobrazení výsledku je třeba se přihlásit.
K zobrazení výsledku je třeba se přihlásit.
Publikováno v:
In Operations Research Letters September 2016 44(5):612-617