Use this URL to cite or link to this record in EThOS: https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.815925
Title: Reinforcement learning in persistent environments : representation learning and transfer
Author: Borsa, Diana
ISNI:       0000 0004 9359 0694
Awarding Body: UCL (University College London)
Current Institution: University College London (University of London)
Date of Award: 2020
Availability of Full Text:
Access from EThOS:
Full text unavailable from EThOS. Please try the link below.
Access from Institution:
Abstract:
Reinforcement learning (RL) provides a general framework for modelling and reasoning about agents capable of sequential decision making, with the goal of maximising a reward signal. In this work, we focus on the study of situated agents designed to learn autonomously through direct interaction with their environment, under limited or sparse feedback. We consider an agent in a persistent environment. The dynamics of this ’world’ do not change over time, much like the laws of physics, and the agent would need to learn to master a potentially vast set of tasks in this environment. To efficiently tackle learning in multiple tasks, with the ultimate goal of scaling to a life-long learning agent, we turn our attention to transfer learning. The main insight behind this paradigm is that generalisation may occur not only within tasks, but also across them. The objective of transfer in RL is to accelerate learning by building and reusing knowledge obtained in previously encountered tasks. This knowledge can be in the form of samples, value functions, policies, shared features or other abstractions of the environment or behaviour. In this thesis, we examine different ways of learning transferable representations for value functions. We start by considering jointly learning value functions across multiple reward signals. We explore doing this by leveraging known multitask techniques to learn a shared set of features that cater to the intermediate solutions of popular iterative dynamic learning processes – like value and policy iteration. This learnt representation evolves as the individual value functions improve. At the end of this process, we obtain a shared basis for (near) optimal value functions. We show that this process benefits the learning of good policies forthetasks considered inthis joint learning. This class of algorithms is potentially very general, but somewhat agnostic to the persistent environment assumption. Thus we turn to ways of building this shared basis by leveraging more explicitly the rich structure induced by this assumption. This leads to various extensions of least-squares Policy Iteration methods to the multitask scenario, under shared dynamics. Here we leverage transfer of samples and multitask regression to further improve sample efficiency in building these shared representations, capturing commonalities across optimal value functions. The second part of the thesis introduces a different way of representing knowledge via successor features. In contrast to the representations learnt in the first part, these are policy dependent and serve as a basis for policy evaluations, rather than directly building optimal value functions. As such, the way to transfer knowledge to a new task changes as well. We do this by first relating the new task to previous learnt ones. In particular, we try to approximate the new reward signal as a linear combination of previous ones. Under this approximation, we can obtain approximate evaluations of the quality of previously learnt policies on the new task. This enables us to carry over knowledge about good or bad behaviour across tasks and strictly improve on previous behaviours. Here the transfer leverages the structure in policy space, with the potential of re-using partial solutions learnt in previous tasks. We show empirically that this leads to a scalable, online algorithm that can successfully re-use the common structure, if present, between a set of training tasks and a new one. Finally, we show that if one has further knowledge about the reward structure an agent would encounter, one can leverage this to learn very effectively, in an off-policy and off-task manner, a parameterised collection of successor features. These correspondto multiple (near) optimal policies for tasks hypothesized by the agent. This not only makes very efficient use of the data but proposes a parametric solution to the behaviour basis problem; namely which policies should one learn to enable transfer.
Supervisor: Not available Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID: uk.bl.ethos.815925  DOI: Not available
Share: