Use this URL to cite or link to this record in EThOS: http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.658011
Title: Towards a resilience investigation framework for high performance computing
Author: Naughton, Thomas J.
ISNI:       0000 0004 5351 6588
Awarding Body: University of Reading
Current Institution: University of Reading
Date of Award: 2014
Availability of Full Text:
Access through EThOS:
Abstract:
As large-scale scientific computing platforms increase in size and capability, their complexity also grows. These systems require great care and attention, much of which is due to the rise in failures from increased node/ component counts. Fault tolerance, or resilience, is a key challenge for computing and a major factor in the successful utilization of high-end scientific computing platforms. As the importance of fault tolerance increases, methods for experimentation into new mechanisms and policies are critical. The methodical investigation of failure in these systems is hampered by their scale, and a lack of tools for controlled experimentation. The focus of this research is to provide a versatile: low-overhead platform for fault tolerance/ resilience experimentation in a high-performance computing (HPC) environment. The objective is to extend the HPC workflow and toolkit to provide ways for studying largescale scientific applications at extreme scales with synthetic faults (errors) in a controlled environment. As part of this research we leverage prior work in the areas of HPC system software and performance evaluation tools to enable controlled experimentation through fault injection, while maintaining acceptable performance for scientific workloads. The research identifies two crucial characteristics that are balanced for fault-injection experiments: (i) integration (context), and (ii) isolation (protection). The result of this research is a Resilience Investigation Framework (RIF) that provides HPC users and developers a versatile experimental framework that balances integration and isolation when exploring resilience methods and policies in large-scale systems
Supervisor: Not available Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID: uk.bl.ethos.658011  DOI: Not available
Share: