Use this URL to cite or link to this record in EThOS:
Title: The exploration-exploitation trade-off in sequential decision making problems
Author: Sykulski, Adam M.
ISNI:       0000 0004 2710 3341
Awarding Body: Imperial College London
Current Institution: Imperial College London
Date of Award: 2011
Availability of Full Text:
Access from EThOS:
Access from Institution:
Sequential decision making problems require an agent to repeatedly choose between a series of actions. Common to such problems is the exploration-exploitation trade-off, where an agent must choose between the action expected to yield the best reward (exploitation) or trying an alternative action for potential future benefit (exploration). The main focus of this thesis is to understand in more detail the role this trade-off plays in various important sequential decision making problems, in terms of maximising finite-time reward. The most common and best studied abstraction of the exploration-exploitation trade-off is the classic multi-armed bandit problem. In this thesis we study several important extensions that are more suitable than the classic problem to real-world applications. These extensions include scenarios where the rewards for actions change over time or the presence of other agents must be repeatedly considered. In these contexts, the exploration-exploitation trade-off has a more complicated role in terms of maximising finite-time performance. For example, the amount of exploration required will constantly change in a dynamic decision problem, in multiagent problems agents can explore by communication, and in repeated games, the exploration-exploitation trade-off must be jointly considered with game theoretic reasoning. Existing techniques for balancing exploration-exploitation are focused on achieving desirable asymptotic behaviour and are in general only applicable to basic decision problems. The most flexible state-of-the-art approaches, έ-greedy and έ-first, require exploration parameters to be set a priori, the optimal values of which are highly dependent on the problem faced. To overcome this, we construct a novel algorithm, έ-ADAPT, which has no exploration parameters and can adapt exploration on-line for a wide range of problems. έ-ADAPT is built on newly proven theoretical properties of the έ-first policy and we demonstrate that έ-ADAPT can accurately learn not only how much to explore, but also when and which actions to explore.
Supervisor: Adams, Niall ; Jennings, Nick Sponsor: BAE Systems ; EPSRC
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral