Title:
|
New software-based fault tolerance methods for high performance computing
|
As computer systems become ever more powerful and parallel, processing larger and
larger sets of data, there is increased need for ensuring that scientific software applications
are tolerant to faults in both hardware and software. New algorithms which take advantage
of knowledge about the structure and calculation of important mathematical problems
would enable increasingly more efficient and fault tolerant computation to be performed
with minimal overhead.
This thesis demonstrates how improvements to two important application areas in High
Performance Computing (HP C) - that of Monte Carlo methods and Sparse Linear Algebra
- can result in software with greater fault tolerance alongside low overheads. It
proposes models that employ variations on existing techniques dealing with layout topologies
in grids and a form of Error-Correcting Code (ECC) to provide an increased degree
of fault tolerance in calculations. The models make efficient use of the variations to produce
schemes that are both robust and based on straightforward approaches which can be
implemented in a simple manner.
The theory behind the models is developed and evaluated and basic implementations
are created to gauge the performance and viability of the schemes. Both models perform
well in the majority of cases with low overheads in the range of 0-10%, and both are
eminently scalable. Furthermore, the methods with highest overhead in the Sparse Linear
Algebra schemes are found to increase in performance for larger data sets that are more
sparse - those that would require the extra protection afforded by software fault tolerance
the most.
|