Budgeted Classification-based Policy Iteration

Autor:	Gabillon, Victor
Přispěvatelé:	Laboratoire d'Informatique Fondamentale de Lille (LIFL), Université de Lille, Sciences et Technologies-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lille, Sciences Humaines et Sociales-Centre National de la Recherche Scientifique (CNRS), Sequential Learning (SEQUEL), Université de Lille, Sciences et Technologies-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lille, Sciences Humaines et Sociales-Centre National de la Recherche Scientifique (CNRS)-Université de Lille, Sciences et Technologies-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lille, Sciences Humaines et Sociales-Centre National de la Recherche Scientifique (CNRS)-Laboratoire d'Automatique, Génie Informatique et Signal (LAGIS), Université de Lille, Sciences et Technologies-Centrale Lille-Centre National de la Recherche Scientifique (CNRS)-Centrale Lille-Centre National de la Recherche Scientifique (CNRS)-Inria Lille - Nord Europe, Institut National de Recherche en Informatique et en Automatique (Inria), Universite Lille 1, Mohammad Ghavamzadeh, Philippe Preux
Jazyk:	angličtina
Rok vydání:	2014
Předmět:	algorithm bandit games adaptive sampling jeux de bandits decision-making artificial intelligence prise de décision (statistique) machine learning intelligence artificielle [STAT.ML]Statistics [stat]/Machine Learning [stat.ML] apprentissage automatique apprentissage par renforcement échantillonnage adaptatif (statistique) reinforcement learning algorithmes
Zdroj:	Machine Learning [stat.ML]. Universite Lille 1, 2014. English
Popis:	This dissertation is motivated by the study of a class of reinforcement learning (RL) algorithms, called classification-based policy iteration (CBPI). Contrary to the standard RL methods, CBPI do not use an explicit representation for value function. Instead, they use rollouts and estimate the action-value function of the current policy at a collection of states. Using a training set built from these rollout estimates, the greedy policy is learned as the output of a classifier. Thus, the policy generated at each iteration of the algorithm, is no longer defined by a (approximated) value function, but instead by a classifier. In this thesis, we propose new algorithms that improve the performance of the existing CBPI methods, especially when they have a fixed budget of interaction with the environment. Our improvements are based on the following two shortcomings of the existing CBPI algorithms: 1) The rollouts that are used to estimate the action-value functions should be truncated and their number is limited, and thus, we have to deal with bias-variance tradeoff in estimating the rollouts, and 2) The rollouts are allocated uniformly over the states in the rollout set and the available actions, while a smarter allocation strategy could guarantee a more accurate training set for the classifier. We propose CBPI algorithms that address these issues, respectively, by: 1) the use of a value function approximation to improve the accuracy (balancing the bias and variance) of the rollout estimates, and 2) adaptively sampling the rollouts over the state-action pairs.; Cette thèse étudie une classe d’algorithmesd’apprentissage par renforcement (RL), appelée « itération sur les politiques obtenuespar classification » (CBPI). Contrairement aux méthodes standards de RL, CBPIn’utilise pas de représentation explicite de la fonction valeur. CBPI réalise des déroulés(des trajectoires) et estime la fonction action-valeur de la politique courante pour unnombre limité d’états et d’actions. En utilisant un ensemble d’apprentissage con-struit à partir de ces estimations, la politique gloutonne est apprise comme le produitd’un classificateur. La politique ainsi produite à chaque itération de l’algorithme,n’est plus définie par une fonction valeur (approximée), mais par un classificateur.Dans cette thèse, nous proposons de nouveaux algorithmes qui améliorent les perfor-mances des méthodes CBPI existantes, spécialement lorsque le nombre d’interactionsavec l’environnement est limité. Nos améliorations se portent sur les deux limitationsde CBPI suivantes : 1) les déroulés utilisés pour estimer les fonctions action-valeurdoivent être tronqués et leur nombre est limité, créant un compromis entre le biais etla variance dans ces estimations, et 2) les déroulés sont répartis de manière uniformeentre les états déroulés et les actions disponibles, alors qu’une stratégie plus évoluéepourrait garantir un ensemble d’apprentissage plus précis. Nous proposons des algo-rithmes CBPI qui répondent à ces limitations, respectivement : 1) en utilisant uneapproximation de la fonction valeur pour améliorer la précision (en équilibrant biais etvariance) des estimations, et 2) en échantillonnant de manière adaptative les déroulésparmi les paires d’état-action.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=od_______212::a66701674303cb95a4b41c2d4f60ce4e https://tel.archives-ouvertes.fr/tel-01297386/document Zobrazit plný text záznamu