Fault injection testing of software implemented fault tolerance mechanisms of distributed systems
One way of gaining confidence in the adequacy of fault tolerance mechanisms of a system is to test the system by injecting faults and see how the system performs under faulty conditions. This thesis investigates the issues of testing software-implemented fault tolerance mechanisms of distributed systems through fault injection. A fault injection method has been developed. The method requires that the target software system be structured as a collection of objects interacting via messages. This enables easy insertion of fault injection objects into the target system to emulate incorrect behaviour of faulty processors by manipulating messages. This approach allows one to inject specific classes of faults while not requiring any significant changes to the target system. The method differs from the previous work in that it exploits an object oriented approach of software implementation to support the injection of specific classes of faults at the system level. The proposed fault injection method has been applied to test software-implemented reliable node systems: a TMR (triple modular redundant) node and a fail-silent node. The nodes have integrated fault tolerance mechanisms and are expected to exhibit certain behaviour in the presence of a failure. The thesis describes how various such mechanisms (for example, clock synchronisation protocol, and atomic broadcast protocol) were tested. The testing revealed flaws in implementation that had not been discovered before, thereby demonstrating the usefulness of the method. Application of the approach to other distributed systems is also described in the thesis.