Use this URL to cite or link to this record in EThOS:
Title: Novel recommendation algorithms with applications to healthcare data analysis
Author: Yue, Wenbin
ISNI:       0000 0005 0294 3640
Awarding Body: Brunel University London
Current Institution: Brunel University
Date of Award: 2020
Availability of Full Text:
Access from EThOS:
Access from Institution:
Along with the development of the times, people are paying more and more attention to health issues. The incorporation of machine learning technology has led to unprecedented development in many disease studies that are concentrated on prevalent diseases. Unfortunately, for many rare diseases, there are still many limitations. Friedreich's ataxia (FRDA) is a rare inherited neurodegenerative disorder that causes progressive damage to nervous systems and performance deterioration of physical movements. European Friedreich's Ataxia Consortium for Translational Studies (EFACTS), which is funded by the European Union project, has integrated disease-related resources and assembled a large pool of experts to promote FRDA research. FRDA baseline data analysis and application play a crucial role in advancing the disease research, but there are many obstacles that prevent EFACTS from collecting patient baseline data: • Lack of the rare disease awareness (individual): For individuals, the disease can be overlooked by the patients' families due to less severe pre-disease symptoms and lack of relevant medical knowledge. • Lack of the rare disease awareness (local hospital): For doctors in local hospitals, they might be unable to make correct and effective diagnosis in a timely manner because of the complexity of the clinical manifestations of these diseases and the fact that local hospitals are likely to lack specialists and knowledge in the relevant fields, etc. • Medical system problems: There is a lack of detailed and effective diagnostic process for rare diseases. • Economic & physical reasons: Medical resources for rare diseases are concentrated in large cities. Many patients in other regions do not have the financial or physical conditions to go to big cities for diagnosis and treatments. There are three challenging issues in helping with FRDA baseline data collection from computer science perspective: 1) how to develop appropriate strategies to overcome existing difficulties to help the collection of diseases; 2) how appropriate machine learning methods can be used for effective baseline data collection according to the actual situation of FRDA; and 3) how to develop novel algorithms to ensure the accuracy of data collection based on various scenarios that may occur. In this thesis, machine learning techniques are used to address the difficulties on the current baseline data collection and missing value prediction in FRDA. Based on the idea of recommendation system (RS) in machine learning, a new collection strategy and some improved algorithms have been proposed to address various possible difficulties in data collection. The main work is as follows: • To help FRDA baseline data collection, a novel data collection strategy is proposed for the FRDA baseline data by using the collaborative filtering (CF) approaches. This strategy is motivated by the popularity of the nowadays "RS" whose central idea is based on the fact that similar patients have similar symptoms on each test-item. By doing so, instead of having no data at all, the FRDA researchers would be provided with certain predicted baseline data on patients who cannot attend the assessments for physical/psychological reasons, thereby helping with the data analysis from the researchers' perspective. It is shown that the CF approaches are capable of predicting baseline data based on the similarity in test-items of the patients, where the prediction accuracy is evaluated based on three rating scales selected from the EFACTS database. • With the aim to facilitate the baseline data collection with improved prediction accuracy, the framework of the proposed algorithm is constructed based on a novel hybrid model combining the merits of model- and memory-based CF methods. The proposed hybrid algorithm exhibits the following two main features: 1) when a patient does not have neighbors sharing similar baseline data, the model-based CF component is activated to employ certain clustering method to find similar neighbors based on their attributes; and 2) in the case that a patient does have neighbors, a novel similarity measure, which accounts for more statistical characteristics by integrating rating habits and degree of co-rated items, is developed in the memory-based component of the algorithm in order to adjust initial similarities between the patients. To evaluate the advantages of the proposed algorithm, the SARA is selected from the database of EFACTS. • In order to handle cold-start condition during FRDA baseline data collection, a weighted-naive-Bayes based CF (WNBCF) algorithm is proposed by taking into account the patient side-information. To be specific, the patient side-information is treated as weighted attributes in the WNBCF algorithm to facilitate the prediction of the severity of different bodily functions of FRDA patients. An attribute-weighting algorithm is first presented based on the mutual information to support weight selection. To improve the performance of selected weights, the particle swarm optimization algorithm is then exploited to optimize the weights obtained by the attribute-weighting algorithm. In order to assess the superiorities of the proposed WNBCF algorithm, real-world FRDA datasets are chosen from the database provided by EFACTS (the European Friedreich's Ataxia Consortium for Translational Studies). • A modified collaborative filtering (MCF) algorithm with improved performance is developed for recommendation systems with application in predicting baseline data of FRDA patients. The proposed MCF algorithm combines the individual merits of both the user-based CF method and the item-based CF method, where both the positively and negatively correlated neighbors are taken into account. The weighting parameters are introduced to quantify the degrees of utilizations of the user-based CF and item-based CF methods in the rating prediction, and the particle swarm optimization algorithm is applied to optimize the weighting parameters in order to achieve an adequate tradeoff between the positively and negatively correlated neighbors in terms of predicting the rating values. To demonstrate the prediction performance of the proposed MCF algorithm, the developed MCF algorithm is employed to assist with the baseline data collection for the FRDA patients.
Supervisor: Wang, Z. ; Liu, X. Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available
Keywords: Recommendation System ; Healthcare ; Data Analysis ; Data Prediction ; Optimization