Use this URL to cite or link to this record in EThOS: http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.546973
Title: Kernels for protein homology detection
Author: Spalding, John Dylan
Awarding Body: University of Exeter
Current Institution: University of Exeter
Date of Award: 2009
Availability of Full Text:
Access from EThOS:
Access from Institution:
Abstract:
Determining protein sequence similarity is an important task for protein classification and homology detection, which is typically performed using sequence alignment algorithms. Fast and accurate alignment-free kernel based classifiers exist, that treat protein sequences as a “bag of words”. Kernels implicitly map the sequences to a high dimensional feature space, and can be thought of as an inner product between two vectors in that space. This allows an algorithm that can be expressed purely in terms of inner products to be ‘kernelised’, where the algorithm implicitly operates in the kernel’s feature space. A weighted string kernel, where the weighting is derived using probabilistic methods, is implemented using a binary data representation, and the results reported. Alternative forms of data representation, such as Ising and frequency forms, are implemented and the results discussed. These results are then used to inform the development of a variety of novel kernels for protein sequence comparison. Alternative forms of classifier are investigated, such as nearest neighbour, support vector machines, and multiple kernel learning. A kernelized Gaussian classifier is derived and tested, which is informative as it returns a score related to the probability of a sequence belonging to a particular classification. Support vector machines are tested with the introduced kernels, and the results compared to alternate classifiers. As similarity can be thought of as having different components, such as composition and position, multiple kernel learning is investigated with the novel kernels developed here. The results show that a support vector machine, using either single or multiple kernels, is the best classifier for remote protein homology detection out of all the classifiers tested in this thesis.
Supervisor: Everson, Richard ; Hoyle, David Sponsor: EPSRC
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID: uk.bl.ethos.546973  DOI: Not available
Keywords: Kernel methods ; Protein homology ; Support vecotr machine ; Classification
Share: