Use this URL to cite or link to this record in EThOS:
Title: Bayesian sampling in contextual-bandit problems with extensions to unknown normal-form games
Author: May, Benedict C.
Awarding Body: University of Bristol
Current Institution: University of Bristol
Date of Award: 2013
Availability of Full Text:
Access from EThOS:
In sequential decision problems in unknown environments, decision makers often face dilemmas over whether to explore to discover more about the environment, or to exploit current knowledge. In this thesis, we address this exploration/exploitation dilemma in a general setting encompassing both standard and contextualised bandit problems, and also multi-agent (game-theoretic) problems. We consider an approach of Thompson (1933) which makes use of samples from the posterior distributions for the instantaneous value of each action. Our initial focus is on problems with a single decision maker acting. We extend the approach of Thompson (1933) by introducing a new algorithm, Optimistic Bayesian Sampling (OBS), in which the probability of playing an action increases with the uncertainty in the estimate of the action value. This results in better directed exploratory behaviour. We prove that, under unrestrictive assumptions, both approaches result in optimal behaviour with respect to the average reward criterion of Yang and Zhu(2002) . The problem has recently resurfaced in the context of contextual bandits for maximising revenue in sponsored web search advertising. We implement OBS and test its performance in several simulated domains. We find that it performs consistently better than numerous competitor methods. Our second focus is that of extending the method of Thompson (1933) to problems with more than one decision maker acting, and individual rewards depending on actions of others. Each agent must predict the actions of others to maximise reward. We consider combining Thompson sampling with fictitious play and establish conditions under which agents strategies converge to best responses to the empirical frequencies of opponent play, and also under which the belief process is a generalised weakened fictitious play process of Leslie and Collins (2006). Fictitious play is a deterministic algorithm, and so is not entirely consistent with the philosophy of Thompson sampling. We consider combining Thompson sampling with a randomised version of fictitious play that guarantees players play best responses to the empirical frequencies of opponent play. We also consider how the LTS and OBS algorithms can be extended to team games, where all agents receive the same reward. We suggest a novel method of achieving 'perfect coordination', in the sense that the multi-agent problem is effectively reduced to a single-agent problem.
Supervisor: Not available Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available