Use this URL to cite or link to this record in EThOS:
Title: Online evolving clustering approaches to improving web search results
Author: Evans, Anthony D.
Awarding Body: Lancaster University
Current Institution: Lancaster University
Date of Award: 2011
Availability of Full Text:
Access from EThOS:
A word's semantic interpretation may vary depending on the context in which it is used, and therefore a Web search for documents containing a specific keyword can result in a contextually disorganised set of results which can be undesirable to an end user who wants to quickly find relevant information. Existing search engines are capable of delivering search results quickly but are not able to take into account the contextual meaning from text and, therefore, the results of search engine queries are typically returned as an unordered list. Some searches where the term used has a distinct meaning will return only Web pages that fit that one meaning, however, many search terms can have multiple meanings and this makes locating the set of contextually relevant documents more difficult in situations where a large number of hits are returned. In this study, document clustering is applied as a solution to aid an end user to find relevant information more efficiently from search results. To cluster documents, it is necessary to compare documents and group together those that are similar based on a similarity distance measure involving words shared between those documents so that contextual groupings can be inferred [23J. Uniquely in this study, Cosine distance was also applied in situations that normally used Euclidean measures [24J. Conventional clustering methods were found to be inadequate when applied to search engine results in this study. For example, an Internet search engine's results only provides pointers to documents and does not contain document vectors, so the required complete dataset is therefore unavailable initially. Existing algorithms that cluster pre fetched documents in an off-line mode (BD) are unsuitable. The clustering component needs to be online so that as each document vector is obtained, it can be processed on the fly to provide immediate res ults. In addition to overcoming limitations of conventional clusterin g, this study also identifies and tackles practical challenges to clustering Webpage documents such as how to reduce noise (redundant text, adverts, code) effectively without compromising speed as a consequence of increasing the complexity of pre-processor functions. In summary, this thesis describes the development and implementation of a novel online cluste ring application implemented to deliver enhanced search engine results in real time. An improvement to currently available algorithms such as the use of Cosine based distance measures was required so that the clustering could be carried out on the output of existing search engines without performance degradation as di mensionality increases. This resulted in the implementation an efficient non iterative online data clustering technique capable of high dimensionality processing based on keyword frequency and Potential ca lculations to contextually cluster documents. This approach offers real-time clustering of search results without the consumption of excessive computing resources that is used by the more conventional clustering algorithms while still being able to adapt to the input of new documents 'on the fly'.
Supervisor: Not available Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available