Use this URL to cite or link to this record in EThOS:
Title: Efficient graph construction for similarity search on high dimensional data
Author: Kanthan, Leslie
ISNI:       0000 0004 8500 320X
Awarding Body: UCL (University College London)
Current Institution: University College London (University of London)
Date of Award: 2019
Availability of Full Text:
Access from EThOS:
Full text unavailable from EThOS. Please try the link below.
Access from Institution:
The K nearest neighbours graph, denoted KNNG, is an essential graph in data mining and machine learning. However, despite its vital significance, exact construction of this graph for high dimensional datasets (d 10) is inefficient (O(n2) computational complexity). Approximate algorithms have been shown to improve upon this complexity, but compromise accuracy. In this thesis, we focus on automatically improving existing locality sensitive hashing schemes and proposing new schemes that find good trade-offs between accuracy and speed. We investigate how to obtain an LSH version with a guaranteed worst-case subquadratic cost that minimises the loss of accuracy. We implement such an algorithm and evaluate its runtime impact for different types of datasets. We implement the most popular versions and perform a detailed experimental comparison and present trends between specific LSH versions and the input dataset characteristics. Relying on the findings of this analysis, we propose Variable Radius LSH (VRLSH), a new LSH scheme that is suitable for distributed computation and capable of handling large datasets. We show how VRLSH can scale efficiently with the size of the dataset, and how it can improve the accuracy of the generated KNNG. Next, we propose three new LSH schemes that rely on the strategy of imitating biological systems. In particular, we propose RFLY, PFLY and DPFLY three schemes inspired by FLY-LSH, a recent variation of the LSH algorithm that relies on the olfactory circuit of flies, used to identify similar odours. We first experiment and expand FLY-LSH by running it on a larger number of datasets. The three proposed algorithms improve both the accuracy and the applicability of FLY-LSH on real datasets. Firstly, RFLY improves the accuracy of the generated graph by 10%. Then PFLY distributes data more appropriately in a pre-fixed number of buckets, while concurrently improving the accuracy of the generated graph. Thirdly, DPFLY adapts random projects to the input dataset, achieving 15% improvement. Hitherto, we propose a novel optimisation framework that uses machine learning techniques and genetic algorithms to automatically select a pareto frontier tuned version of the LSH schemes for a given specific input dataset. In our experiments, our optimisation framework improves the performance (both speed and accuracy) for every version of the LSH algorithm by 10% and 13% respectively. Last, we discuss future work and how the findings of this thesis can further help the research community.
Supervisor: Not available Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available