Heuristic methods for support vector machines with applications to drug discovery
The contributions to computer science presented in this thesis were inspired by the analysis of the data generated in the early stages of drug discovery. These data sets are generated by screening compounds against various biological receptors. This gives a first indication of biological activity. To avoid screening inactive compounds, decision rules for selecting compounds are required. Such a decision rule is a mapping from a compound representation to an estimated activity. Hand-coding such rules is time-consuming, expensive and subjective. An alternative is to learn these rules from the available data. This is difficult since the compounds may be characterized by tens to thousands of physical, chemical, and structural descriptors and it is not known which are most relevant to the prediction of biological activity. Further, the activity measurements are noisy, so the data can be misleading. The support vector machine (SVM) is a statistically well-founded learning machine that is not adversely affected by high-dimensional representations and is robust with respect to measurement inaccuracies. It thus appears to be ideally suited to the analysis of screening data. The novel application of the SVM to this domain highlights some shortcomings with the vanilla SVM. Three heuristics are developed to overcome these deficiencies: a stopping criterion, HERMES, that allows good solutions to be found in less time; an automated method, LAIKA, for tuning the Gaussian kernel SVM; and, an algorithm, STAR, that outputs a more compact solution. These heuristics achieve their aims on public domain data and are broadly successful when applied to the drug discovery data. The heuristics and associated data analysis are thus of benefit to both pharmacology and computer science.