|METELLI ALBERTO MARIA||Cycle: XXXIII |
Section: Computer Science and Engineering
Tutor: GATTI NICOLA
Advisor: RESTELLI MARCELLO Major Research topic
:Reinforcement Learning in Configurable Markov Decision ProcessesAbstract:
Markov Decision Processes (MDPs) are a popular formalism to model sequential decision-making problems. Solving an MDP means to find a policy, i.e., a prescription of actions, which maximizes a given utility function. In classical Reinforcement Learning (RL) framework the MDP parameters are assumed to be fixed, unknown and out of the control of the agent. However, there exist several real-world scenarios in which the environment is partially controllable and, therefore, it might be beneficial to configure some of its features. For instance, a human car driver has at her/his disposal a number of possible vehicle configurations she/he can act on (eg., seasonal tires, stability and vehicle attitude, engine model, automatic speed control, parking aid system) to improve the driving style or speed up the process of learning a good driving policy.
In this research, we introduce a novel framework to model Configurable Markov Decision Processes (Conf-MDPs), i.e., MDPs that admit the possibility to alter some environmental parameters to a limited extent. At an intuitive level, there exists a tight connection between environment, policy and learning process. First, in some contexts, the agent is allowed to select the task to solve within a given set. In this case, it is beneficial to configure the environment in order to identify the MDP maximizing the performance of the optimal policy. Second, even when the task is fixed, it might be convenient to dynamically change the environment (eg., the reward function or the discount factor) in order to ease the learning process, speeding up convergence to the optimal policy. For both cases, we will start with the formulation of RL in the CMDP framework, we will derive learning algorithms capable of taking advantage of the environment configurability and study their theoretical properties. Then, we will compare our algorithms with the state-of-the-art methods, especially with reward shaping and intrinsically motivated learning, in some relevant application like vehicle configuration or teaching planning.