Use this URL to cite or link to this record in EThOS:
Title: Statistical learning for dynamic bandits
Author: Lu, Xue
ISNI:       0000 0004 7963 761X
Awarding Body: Imperial College London
Current Institution: Imperial College London
Date of Award: 2019
Availability of Full Text:
Access from EThOS:
Access from Institution:
The Multi-armed Bandit (MAB) problem is a classic example of the exploration-exploitation dilemma. It is concerned with maximising the total rewards for a gambler/player through sequentially pulling an arm from a multi-armed slot machine where each arm is associated with a reward distribution. In this thesis, we focus on an extension: the dynamic bandit problem which is a more realistic setting for real applications. Breaking down the dynamic bandit problem into two sub-problems: estimation and selection, we can extend the popular selection mechanisms: epsilon-Greedy, Upper Confidence Bound (UCB), and Thompson Sampling (TS) to a dynamic setting by improving their estimate of the expected reward. In this thesis, we use two approaches to improve the estimation: a model-based approach and a data-driven approach. For the model-based approach, we formulate the dynamic bandit problem via a state-space model, and solve it by deploying well studied methods like the Kalman Filter (KF) or Sequential Monte Carlo (SMC), in conjunction with the standard selection mechanisms. The novelty of our algorithms is to integrate within the bandit problem a real-time estimation of the static parameters used in the state-space model. This is useful for real applications as it is hard (or costly) to a priori specify the parameters. These algorithms are generic and can be applied to different reward distributions, e.g., Gaussian and Bernoulli. For the data-driven approach, we mostly focus on Bernoulli rewards, and use an adaptive estimation technique based on Adaptive Forgetting Factors (AFFs) to estimate the expected reward, and implement selections with the popular selection mechanisms. Our algorithms are easy to implement and quite robust to tuning parameters. We also extend these AFF based algorithms to the contextual bandit problem.
Supervisor: Kantas, Nikolas ; Adams, Niall Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral