|METELLI ALBERTO MARIA||Cycle: XXXIII |
Section: Computer Science and Engineering
Tutor: GATTI NICOLA
Advisor: RESTELLI MARCELLO Major Research topic
:Exploiting Environment Configurability in Reinforcement LearningAbstract:
In the last decades, Reinforcement Learning (RL) has emerged as an effective approach to address complex control tasks. The formalism typically employed to model the sequential interaction between the artificial agent and the environment is the Markov Decision Process (MDP). In an MDP, an agent perceives the state of the environment and performs actions. As a consequence, the environment transitions to a new state and generates a reward signal. The goal of the agent consists of learning a policy, i.e., a prescription of actions, that maximizes the long-term reward.
In the traditional setting, the environment is assumed to be a fixed entity that cannot be altered externally. However, there exist several real-world scenarios in which the environment can be modified to a limited extent and, therefore, it might be beneficial to act on some of its features. We call this activity environment configuration, that can be carried out by the agent itself or by an external entity, such as a configurator. Although environment configuration arises quite often in real applications, this topic is very little explored in the literature. In this dissertation, we aim at formalizing and studying the diverse aspects of environment configuration. The contributions are theoretical, algorithmic, and experimental and can be subdivided broadly into three parts.
The first part of the dissertation introduces the novel formalism of Configurable Markov Decision Processes (Conf-MDPs) to model the configuration opportunities offered by the environment. At an intuitive level, there exists a tight connection between environment, policy, and learning process. We explore the different nuances of environment configuration, based on whether the configuration is fully auxiliary to the agent's learning process (cooperative setting) or guided by a configuration having an objective possibly conflicting with the agent's one (non-cooperative setting).
In the second part, we focus on the cooperative Conf-MDPs setting and we investigate the learning problem of finding an agent policy and an environment configuration that jointly optimize the long-term reward. We provide algorithms for solving finite and continuous Conf-MDPs and experimental evaluations are conducted on both synthetic and realistic domains.
The third part addresses two specific applications of the Conf-MDP framework: policy space identification and control frequency adaptation. In the former, we employ environment configurability to induce the learning agent revealing its perception and actuation capabilities. Whereas in the latter, we analyze how a specific configurable environmental parameter, the control frequency, can affect the performance of the batch RL algorithms.