From crash tolerance to Byzantine tolerance : fail signalling dependable distributed systems
Many fault-tolerant group communication middleware systems have been implemented assuming crash failure semantics. While this assumption is not unreasonable, it becomes hard to justify when applications are required to meet high reliability requirements and are built using commercial off the shelf (COTS) components. This thesis implements new techniques to deal with Byzantine faults in a distributed group communication system. This thesis proposes a technique by which a process is duplicated into two replicas such that the process is turned into a self-checking pair with the two replicas communicating synchronously over a reliable network, but two different replicas from different processes can be connected asynchronously. The proposed approach is based on the replicas obeying state machine replication (SMR). SMR is utilised to assure signal-on-failure (fail-signal) semantics. One or both of the two replicas always issues a signal to other entities whenever there is a failure between and within the entities. This way, dependable activities such as group member failure detection, liveliness and security are removed from the upper layer of group communication service down to the two-replica pairs. With most of failure detection and security activities confined between the two replicas, semantics of a group communication are simplified and the number of phases and rounds of group communication protocols is reduced. The thesis demonstrates the fail-signalling concept by converting a group communication system member, through duplication of each group member, into a self checking pair. Security is augmented to the replicas' fail signalling capabilities to tolerate even more serious Byzantine faults. Performance results of the traditional group communication system are compared with results of a group system with duplicated fail signalling group members. The thesis has proven that the fail signalling group communication has the advantage of detecting failures faster without suspicions and that resulted in better group communication semantics, better dealing with member failures and faster formation of new group views.