Use this URL to cite or link to this record in EThOS:
Title: A resource aware distributed LSI algorithm for scalable information retrieval
Author: Liu, Yang
ISNI:       0000 0004 2703 3291
Awarding Body: Brunel University
Current Institution: Brunel University
Date of Award: 2011
Availability of Full Text:
Access from EThOS:
Access from Institution:
Latent Semantic Indexing (LSI) is one of the popular techniques in the information retrieval fields. Different from the traditional information retrieval techniques, LSI is not based on the keyword matching simply. It uses statistics and algebraic computations. Based on Singular Value Decomposition (SVD), the higher dimensional matrix is converted to a lower dimensional approximate matrix, of which the noises could be filtered. And also the issues of synonymy and polysemy in the traditional techniques can be overcome based on the investigations of the terms related with the documents. However, it is notable that LSI suffers a scalability issue due to the computing complexity of SVD. This thesis presents a resource aware distributed LSI algorithm MR-LSI which can solve the scalability issue using Hadoop framework based on the distributed computing model MapReduce. It also solves the overhead issue caused by the involved clustering algorithm. The evaluations indicate that MR-LSI can gain significant enhancement compared to the other strategies on processing large scale of documents. One remarkable advantage of Hadoop is that it supports heterogeneous computing environments so that the issue of unbalanced load among nodes is highlighted. Therefore, a load balancing algorithm based on genetic algorithm for balancing load in static environment is proposed. The results show that it can improve the performance of a cluster according to heterogeneity levels. Considering dynamic Hadoop environments, a dynamic load balancing strategy with varying window size has been proposed. The algorithm works depending on data selecting decision and modeling Hadoop parameters and working mechanisms. Employing improved genetic algorithm for achieving optimized scheduler, the algorithm enhances the performance of a cluster with certain heterogeneity levels.
Supervisor: Li, M. Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available
Keywords: Mapreduce ; Simulation ; Load balancing