Use this URL to cite or link to this record in EThOS:
Title: Performance modelling and optimisation of NoSQL database systems
Author: Dipietro, Salvatore
ISNI:       0000 0004 9350 1567
Awarding Body: Imperial College London
Current Institution: Imperial College London
Date of Award: 2020
Availability of Full Text:
Access from EThOS:
Access from Institution:
Over the last decade, the use of mathematical models and tools to describe and analyse computer applications have grown considerably, for example to automate management in the cloud. Modelling techniques help performance engineers to analyse the behaviour of the system under certain simplifying assumptions and predict its performance, often without running experiments on the real system. However, modern computer applications such as distributed applications can be challenging to describe using models and, even then, their analysis can be technically non-trivial also due to the number of resources involved and their interactions. In this thesis, we consider modelling and optimisation of distributed NoSQL databases, focussing in particular on Apache Cassandra. NoSQL databases have attracted large interest in recent years thanks to their high availability, scalability, flexibility and low latency. Nonetheless, concrete implementations such as Cassandra are challenging to analyse since requests interact in complex ways with the nodes that form the database ring. We address the underpinning modelling and management challenges as follows. We first propose a novel queueing network model for Cassandra to support database resource provisioning exercises. The model defines explicitly key configuration parameters of Cassandra such as consistency levels and replication factor, allowing engineers to compare alternative system setups. The experimental results are conducted using different architectures and hardware resources, achieving good predictive accuracy across different loads and consistency levels. In addition, we also present a case study where the model is used to perform capacity planning activities and to compare possible alternative consistency level definition strategies. A second contribution focuses on management, where we introduce PAX, a partition-aware elastic resource management system for Apache Cassandra. PAX allows engineers to adapt NoSQL database resources to reduce operational costs without compromising Service-Level Objectives (SLOs). Using a low-overhead query sampling and knowledge of the data-partitioning across the nodes, PAX automatically adapts capacity in Cassandra clusters looking for the configuration that is able to achieve the best performance. We analyse the system using a reactive and a proactive implementation of PAX and compare their performance against different workloads with varying intensities and item popularity distributions, finding that in particular the proactive version of PAX significantly reduces SLO violations. We also present a new estimation algorithm to instantiate performance models based on empirical measurements, called State Divergence (SD). Frequently, service demand estimation for real-world systems is calculated in testing environments that can have different characteristics compared to the production ones, leading to inaccurate performance predictions. SD offers a novel approach to demand estimation that has a minimal impact on the application and makes it suitable for application also in production environments. Differently from existing inference algorithms, SD seeks to minimise the divergence between marginal state probability of the real and analysed model to produce accurate demand estimates that reflect not only the performance metrics, but also the likelihood that the system is in an given state. We validate the SD estimation algorithm through several randomly generated models and by means of a real case study conducted on Apache Cassandra. The results show that SD infers with a low error the demands of the system under study and predicts with accuracy its performance, allowing to parameterise performance models with ease and higher fidelity than with existing methods.
Supervisor: Casale, Giuliano Sponsor: Engineering and Physical Sciences Research Council
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral