Use this URL to cite or link to this record in EThOS:
Title: Search and retrieval in massive data collections
Author: Albornoz, Pedro
ISNI:       0000 0004 2703 1659
Awarding Body: University of London
Current Institution: Royal Holloway, University of London
Date of Award: 2010
Availability of Full Text:
Access from EThOS:
Access from Institution:
The main goal of this research is to produce a novel and efficient searching application by means of best match and proximity searching with particular application to very large numeric and textual data stores. In today's world a huge amount of information is produced. Almost every part of our society is touched by systems that collect, store and analyse data. As an example I mention the case of scientific instrumentation: new sensors capture massive amounts of information (e.g. new telescopes acquiring data from different regions of the spectrum). Description of biological and chemical interactions also produce complex and large amounts of data. It is in this context that a big challenge for current analysis algorithms is presented. Many of the traditional methods for data analysis do not scale well in massive data sets nor in very high dimensional spaces. In this work I introduce a novel (ultrametric) distance called Baire based on the longest common prefix and show how it can be used to produce clusters through grouping data in 'bins' taking linear or O(n) computational time. Furthermore, it follows that this distance can be strictly fitted to a hierarchy tree. This is a property that proves very useful for classifying, storing, accessing and retrieving information. I go further to apply this methodology on data from different scientific areas such as astronomy and chemistry to create groups or clusters. Additionally I apply this method to document sets for clustering and retrieval. In particular, I look into the new area of enterprise search to propose a new method to support scalable search and clustering.
Supervisor: Not available Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available
Keywords: search ; searching ; retrieval ; data collections