Use this URL to cite or link to this record in EThOS:
Title: Reinforcement learning with limited prior knowledge in long-term environments
Author: Bossens, David
ISNI:       0000 0004 9352 7388
Awarding Body: University of Southampton
Current Institution: University of Southampton
Date of Award: 2020
Availability of Full Text:
Access from EThOS:
Full text unavailable from EThOS. Please try the link below.
Access from Institution:
Increasingly, artificial learning systems are expected to overcome complex and openended problems in long-term environments, where there is limited knowledge about the task to solve, the learners receive limited observations and sparse feedback, the designer has no control over the environment, and unknown tasks may present at random times to the learner. These features are still challenging for reinforcement learning systems, because the best learning algorithm and the best hyperparameters are not known a priori. Deep reinforcement learning methods are recommended but are limited in the number of patterns they can learn and memorise. To overcome this capacity issue, this thesis investigates long-term adaptivity to improve and analyse reinforcement learning in long-term unknown environments. A first case study in non-episodic mazes with sparse rewards illustrates a novel learning type called active adaptive perception, which actively adapts how to use and modify perception based on a long-term utility function. Such learning systems are here shown to construct emergent long-term strategies to avoid detracting corridors and rooms in non-episodic mazes, where a state-of-the-art deep reinforcement learning system DRQN gets stuck. A consequent case study in lifelong learning, where reinforcement learners must solve different tasks presented in sequence. It is shown that multiple policies each specialised on a subset of the tasks can be used as a source of performance improvement as well as a metric for task capacity, how many tasks a single learner can learn and remember. The case study demonstrates that the DRQN learner has low task capacity compared to an alternative deep reinforcement learning system PPO. The results indicate that this is because PPO’s slower learning allows improved long-term adaptation to different tasks. An additional finding is that adaptively learning which policy to use can be beneficial if the policies are sufficiently different from each other. On the same case study, an additional result shows that, when using a long-term utility function to evaluate performance, a correction for the different reward functions is beneficial to avoid forgetting.
Supervisor: Sobey, Adam James ; Townsend, Nicholas Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available