Use this URL to cite or link to this record in EThOS:
Title: A framework for efficient management of fault tolerance in cloud data centres and high-performance computing systems : an investigation and performance analysis of a cloud based virtual machine success and failure rate in a typical cloud computing environment and prediction methods
Author: Mohammed, Bashir
ISNI:       0000 0004 8497 5699
Awarding Body: University of Bradford
Current Institution: University of Bradford
Date of Award: 2019
Availability of Full Text:
Access from EThOS:
Access from Institution:
Cloud computing is increasingly attracting huge attention both in academic research and industry initiatives and has been widely used to solve advanced computation problem. As cloud datacentres continue to grow in scale and complexity, the risk of failure of Virtual Machines (VM) and hosts running several jobs and processing large amount of user request increases and consequently becomes even more difficult to predict potential failures within a datacentre. However, even though fault tolerance continues to be an issue of growing concern in cloud and HPC systems, mitigating the impact of failure and providing accurate predictions with enough lead time remains a difficult research problem. Traditional existing fault-tolerance strategies such as regular check-point/restart and replication are not adequate due to emerging complexities in the systems and do not scale well in the cloud due to resource sharing and distributed systems networks. In the thesis, a new reliable Fault Tolerance scheme using an intelligent optimal strategy is presented to ensure high system availability, reduced task completion time and efficient VM allocation process. Specifically, (i) A generic fault tolerance algorithm for cloud data centres and HPC systems in the cloud was developed. (ii) A verification process is developed to a fully dimensional VM specification during allocation in the presence of fault. In comparison to existing approaches, the results obtained shows an increase in success rate of the VMs, a reduction in response time of VM allocation and an improved overall performance. (iii) A failure prediction model is further developed, and the predictive capabilities of machine learning is explored by applying several algorithms to improve the accuracy of prediction. Experimental results indicate that the average prediction accuracy of the proposed model when predicting failure is about 90% accurate compared to existing algorithms, which implies that the approach can effectively predict potential system and application failures within the system.
Supervisor: Not available Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available
Keywords: Cloud computing ; Fault tolerance ; Check-pointing ; Virtualisation ; Load balancing ; Virtual machine ; Failure ; Machine learning ; High-performance computing