Fault-tolerant parallel applications using a network of workstations
It is becoming common to employ a Network Of Workstations, often referred to as a NOW, for general purpose computing since the allocation of an individual workstation offers good interactive response. However, there may still be a need to perform very large scale computations which exceed the resources of a single workstation. It may be that the amount of processing implies an inconveniently long duration or that the data manipulated exceeds available storage. One possibility is to employ a more powerful single machine for such computations. However, there is growing interest in seeking a cheaper alternative by harnessing the significant idle time often observed in a NOW and also possibly employing a number of workstations in parallel on a single problem. Parallelisation permits use of the combined memories of all participating workstations, but also introduces a need for communication. and success in any hardware environment depends on the amount of communication relative to the amount of computation required. In the context of a NOW, much success is reported with applications which have low communication requirements relative to computation requirements. Here it is claimed that there is reason for investigation into the use of a NOW for parallel execution of computations which are demanding in storage, potentially even exceeding the sum of memory in all available workstations. Another consideration is that where a computation is of sufficient scale, some provision for tolerating partial failures may be desirable. However, generic support for storage management and fault-tolerance in computations of this scale for a NOW is not currently available and the suitability of a NOW for solving such computations has not been investigated to any large extent. The work described here is concerned with these issues. The approach employed is to make use of an existing distributed system which supports nested atomic actions (atomic transactions) to structure fault-tolerant computations with persistent objects. This system is used to develop a fault-tolerant "bag of tasks" computation model, where the bag and shared objects are located on secondary storage. In order to understand the factors that affect the performance of large parallel computations on a NOW, a number of specific applications are developed. The performance of these applications is ana- lysed using a semi-empirical model. The same measurements underlying these performance predictions may be employed in estimation of the performance of alternative application structures. Using services provided by the distributed system referred to above, each application is implemented. The implement- ation allows verification of predicted performance and also permits identification of issues regarding construction of components required to support the chosen application structuring technique. The work demonstrates that a NOW certainly offers some potential for gain through parallelisation and that for large grain computations, the cost of implementing fault tolerance is low.