Optimal utilization of historical data sets for the construction of software cost prediction models
The accurate prediction of software development cost at early stage of development life-cycle may have a vital economic impact and provide fundamental information for management decision making. However, it is not well understood in practice how to optimally utilize historical software project data for the construction of cost predictions. This is because the analysis of historical data sets for software cost estimation leads to many practical difficulties. In addition, there has been little research done to prove the benefits. To overcome these limitations, this research proposes a preliminary data analysis framework, which is an extension of Maxwell's study. The proposed framework is based on a set of statistical analysis methods such as correlation analysis, stepwise ANOVA, univariate analysis, etc. and provides a formal basis for the erection of cost prediction models from his¬torical data sets. The proposed framework is empirically evaluated against commonly used prediction methods, namely Ordinary Least-Square Regression (OLS), Robust Regression (RR), Classification and Regression Trees (CART), K-Nearest Neighbour (KNN), and is also applied to both heterogeneous and homogeneous data sets. Formal statistical significance testing was performed for the comparisons. The results from the comparative evaluation suggest that the proposed preliminary data analysis framework is capable to construct more accurate prediction models for all selected prediction techniques. The framework processed predictor variables are statistic significant, at 95% confidence level for both parametric techniques (OLS and RR) and one non-parametric technique (CART). Both the heterogeneous data set and homogenous data set benefit from the application of the proposed framework for improving project effort prediction accuracy. The homogeneous data set is more effective after being processed by the framework. Overall, the evaluation results demonstrate that the proposed framework has an excellent applicability. Further research could focus on two main purposes: First, improve the applicability by integrating missing data techniques such as listwise deletion (LD), mean imputation (MI), etc., for handling missing values in historical data sets. Second, apply benchmarking to enable comparisons, i.e. allowing companies to compare themselves with respect to their productivity or quality.