Use this URL to cite or link to this record in EThOS:
Title: Statistical modelling of markers of severity in rheumatoid arthritis
Author: Taylor, Lyndsey
ISNI:       0000 0004 5348 8716
Awarding Body: University of Sheffield
Current Institution: University of Sheffield
Date of Award: 2014
Availability of Full Text:
Access through EThOS:
Access through Institution:
Introduction: Rheumatoid arthritis (RA) is a complex, chronic, autoimmune disorder. The severity of RA varies considerably between patients. Stratifying patients using effective prognostic biomarkers may facilitate therapeutic targeting of biological agents to those at risk of severe joint damage. Genetic variants including single nucleotide polymorphisms (SNPs) and environmental factors are known to associate with RA severity. However to date, no one has attempted to build a full predictive model for RA severity from these associated factors due to the high dimensional, highly correlated nature of the variables. Methods: Available data from a case-controlled study investigating genotype-phenotype associations in RA was used to investigate the predictors of RA severity (cases only). Using a sparse form of partial least squares (PLS) methodology, genetic SNPs and environmental factors were investigated to form a prediction model of a quantitative validated measure of erosive joint damage, called the Larsen score, before extending the methods to multiple RA severity measures. PLS is a dimension reduction technique which reduces the original variables to a linear combination with the influence of each variable being represented by a ‘loading’. As ‘loadings’ are used to assess variable importance rather than beta coefficients from a regression model, PLS is not restricted by standard regression assumptions. Two sets of data were investigated; a genome wide association study (GWAS) recorded on 394 subjects referred to as ‘GWAS SNPs’ dataset and a maximum of 1009 subjects with 368 SNPs referred to as ‘all subjects’ dataset. A new method was developed to prevent over fitting of the PLS models which involved a three stage procedure. The first stage determined the order of predictive importance for the variables using 10 runs of 5, 7 or 10 fold cross validation (CV) (depending on the sample size). Absolute PLS loadings for each variable were ranked and the median calculated across the folds and runs to order the variables. The ‘GWAS SNPs’ dataset was analysed in 40 separate blocks of data. Variables ranked <200 were carried forward to a higher level model. The second stage investigated the number of variables to retain in the final model using an independent training and test set. The third stage tested the chosen model on a further independent set. Results: ‘GWAS SNPs’ dataset: Over fitted models containing 100 variables predicted well during CV (r=0.890). However, they performed poorly when tested on an independent set (r=0.385). Adding a second stage to the modelling prevented the over fitting, however only three variables were selected for the final model (disease duration, symptom duration and age at time of diagnosis) to achieve the highest correlation (r=0.622). ‘All subjects’ dataset: Applying a three stage process resulted in a 10 variable model (disease duration, symptom duration, age at onset of symptoms, age at time of diagnosis, anti-citrullinated protein antibody (ACPA) category, ACPA value, body mass index (BMI), rs26510, DRB1 S2 and rs26232). The model predicted 182 independent subjects with a correlation of r=0.456. Analysing ACPA positive patients only increased the predictive correlation on an independent set (r=0.629), using a model with six variables (disease duration, symptom duration, age at onset of symptoms, age at time of diagnosis, BMI and rs2073839). Multiple Y variable modelling did not increase the ability to predict the Larsen score and other disease severity variables were poorly predicted. Conclusion: SPLS is able to select key predictors of RA severity from a large dataset. A three stage approach is recommended to avoid over fitting of the model. Further research is required to investigate the success of the methodology of a more homogenous cohort.
Supervisor: Teare, M. D. ; Wilson, A. G. Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available