Meta-data to enhance case-based prediction
The focus of this thesis is to measure the regularity of case bases used in Case-Based Prediction (CBP) systems and the reliability of their constituent cases prior to the system's deployment to influence user confidence on the delivered solutions. The reliability information, referred to as meta-data, is then used to enhance prediction accuracy. CBP is a strain of Case-Based Reasoning (CBR) that differs from the latter only in the solution feature which is a continuous value. Several factors make implementing such systems for prediction domains a challenge. Typically, the problem and solution spaces are unbounded in prediction problems that make it difficult to determine the portions of the domain represented by the case base. In addition, such problem domains often exhibit complex and poorly understood interactions between features and contain noise. As a result, the overall regularity in the case base is distorted which poses a hindrance to delivery of good quality solutions. Hence in this research, techniques have been presented that address the issue of irregularity in case bases with an objective to increase prediction accuracy of solutions. Although, several techniques have been proposed in the CBR literature to deal with irregular case bases, they are inapplicable to CBP problems. As an alternative, this research proposes the generation of relevant case-specific meta-data. The meta-data is made use of in Mantel's randomisation test to objectively measure regularity in the case base. Several novel visualisations using the meta-data have been presented to observe the degree of regularity and help identify suspect unreliable cases whose reuse may very likely yield poor solutions. Further, performances of individual cases are recorded to judge their reliability, which is reflected upon before selecting them for reuse along with their distance from the problem case. The intention is to overlook unreliable cases in favour of relatively distant yet more reliable ones for reuse to enhance prediction accuracy. The proposed techniques have been demonstrated on software engineering data sets where the aim is to predict the duration of a software project on the basis of past completed projects recorded in the case base. Software engineering is a human-centric, volatile and dynamic discipline where many unrecorded factors influence productivity. This degrades the regularity in case bases where cases are disproportionably spread out in the problem and solution spaces resulting in erratic prediction quality. Results from administering the proposed techniques were helpful to gain insight into the three software engineering data sets used in this analysis. The Mantel's test was very effective at measuring overall regularity within a case base, while the visualisations were learnt to be variably valuable depending upon the size of the data set. Most importantly, the proposed case discrimination system, that intended to reuse only reliable similar cases, was successful at increasing prediction accuracy for all three data sets. Thus, the contributions of this research are some novel approaches making use of meta-data to firstly provide the means to assess and visualise irregularities in case bases and cases from prediction domains and secondly, provide a method to identify unreliable cases to avoid their reuse in favour to more reliable cases to enhance overall prediction accuracy.