Use this URL to cite or link to this record in EThOS: http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.559260
Title: Swarm-array computing : a swarm robotics inspired approach to achieve automated fault tolerance in high-performance computing systems
Author: Varghese, Blesson
Awarding Body: University of Reading
Current Institution: University of Reading
Date of Award: 2011
Availability of Full Text:
Access through EThOS:
Abstract:
Abstract: Fault tolerance is an important area of research in high-performance computing. Traditional fault tolerant methods which require human administrator intervention are challenged by many drawbacks and hence pose a constraint in achieving efficient fault tolerance for high-performance computer systems. The research presented in this dissertation is motivated towards the development of automated fault tolerant methods for high-performance computing. To this end, four questions are addressed: (1) How can autonomic computing concepts be ap- plied to parallel computing? (2) How can a bridge between multi-agent systems and parallel computing systems be built for achieving fault tolerance? (3) How can pro- cessor virtualization for process migration be extended for achieving fault tolerance in parallel computing systems? (4) How can traditional fault tolerant methods be replaced to achieve efficient fault tolerance in high-performance computing systems? In this dissertation, Swarm-Array Computing, a novel framework inspired by the concept of multi-agents in swarm robotics, and built on the foundations of parallel and autonomic computing is proposed to address these questions. The framework comprises three approaches, firstly, intelligent agents, secondly, intelligent cores, and thirdly, a combination of these as a means to achieving automated fault tolerance inline with the goals of autonomic computing. The feasibility of the framework is evaluated using simulation and practical experimental studies. The simulation studies were performed by emulating a field programmable gate array on a multi-agent simulator. The practical studies involved the implementation of a parallel reduction algorithm using message passing interfaces on a computer cluster. The statistics gathered from the experiments confirm that the swarm-array computing approaches improve the fault tolerance of high-performance computing systems over traditional fault tolerant mechanisms. The agent concepts within the framework are formalised by mapping a layered architecture onto both intelligent agents and intelligent cores. Elements of the work reported in this dissertation have been published as journal and conference papers (Appendix A) and presented as public lectures, conference presentations and posters (Appendix B).
Supervisor: Not available Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID: uk.bl.ethos.559260  DOI: Not available
Share: