Use this URL to cite or link to this record in EThOS:
Title: Prediction of panel and streaming data using wavelet transform-based decision trees
Author: Zhao, Xin
Awarding Body: University of Leeds
Current Institution: University of Leeds
Date of Award: 2018
Availability of Full Text:
Access from EThOS:
Access from Institution:
Decision trees are a popular model for classification and regression since they have an easy interpretation and no parameter assumptions. In the tree building process, we choose the Gini index as the splitting criterion which has good performance for data with many missing values and many categories (values). Other splitting criteria in use include averaged squared error and statistical significant testing. In the tree pruning process, we use cross validation to choose the best tree which has the minimum possible prediction error. When the explanatory variables are time series, however, trees can not detect the potential correlation in them and may be influenced by the noise involved. So we use wavelet analysis to transform the original time series into wavelet transformed variables, by decomposing the original time series into scaling and wavelet coefficients, representing the smooth and detail information at different resolution levels. The basis we choose is the Haar wavelet, as it is simple for interpretation. Other bases are also considered, but they do not have obviously better performance than the Haar wavelet. Although the approach of using the wavelet transform is suitable for data without too many variables to control the computational time, the computational time increases due to using high dimensional wavelet transformed variables is roughly only linear in the increase in the number of variables. So the computational time will not increase rapidly when the data are transformed into suitable resolution levels or when the number of original variables is not a lot. The first application of decision trees with wavelet transformed variables is panel data classification. Trees can classify each observation, but are not able to classify each individual which contains many observations. So we design three methods for panel data classification. After classify each observation using trees, Method 1 classifies each individual by summarizing the major class of its observations. Method 3 transfers the panel data into cross sectional data by summarizing the information for each individual and then uses trees to classify this cross sectional data. Method 2 is based on Method 1 and is similar to but more complicated than Method 3. The difference between Method 2 and Method 3 is that the transformed cross-sectional data are no longer heart rate values or wavelet transformed heart rate values but the probabilities for each observation to be classified as group 1. The probability is calculated from Method 1. So we number this method as the second one as it is based on Method 1. Results show Method 3 is generally the best on both simulated and real data as it works directly on individuals while Methods 1 and 2 are based on classification results of observations, which is not our primary target. The second and the third applications are time series prediction. In the second one, we explore, for static regression, whether or not wavelet transformed variables are better than original variables in regression problems under different circumstances. This includes different seasonal effects at a possible time lag of explanatory variables. The mod- els are then applied to real liver transplantation (LT) surgery data and China air pollution data, both of which show that the wavelet trans- formed variables are better. Wavelet transformed variables are directly used in the third application: interval forecasting for streaming data. In the forecasting process, if both the predicted value and its prediction interval are known, we will know more about the uncertainty in the prediction. There are two choices for interval construction. Gaussian prediction intervals work well if the time series clearly follows a Gaussian distribution. The quantile interval is not restricted by the Gaussian distribution assumptions, which is suitable in this context as we do not know the distribution of the future data. The performance is measured by coverage and interval width. Instead of using only one model, ensemble models are also considered. By comparing trees produced using typical models like ARIMA and GARCH in both simulation and real data applications, we find trees are more computationally efficient than both alternative models. Compared with trees, ARIMA may have a much wider prediction interval when trend is falsely detected and is slow to react when the distribution changes. GARCH has similar performance to trees in coverage and interval width. So tree methods are suggested for time series prediction. When comparing the performance of wavelet transformed variables and original variables in both classification and regression simulation and real data applications, results show that wavelet transformed variables are better than or equal to the performance of original variables in ac- curacy. Models using wavelet transformed variables also provide more detailed information, which give better understanding of the classification or regression process.
Supervisor: Barber, Stuart ; Taylor, Charles Sponsor: China Scholarship Council
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available