Accuracy of logistic models and receiver operating characteristic curves
The accuracy of prediction is a commonly studied topic in modern statistics. The performance of a predictor is becoming increasingly more important as real-life decisions axe made on the basis of prediction. In this thesis we investigate the prediction accuracy of logistic models from two different approaches. Logistic regression is often used to discriminate between two groups or populations based on a number of covariates. The receiver operating characteristic (ROC) curve is a commonly used tool (especially in medical statistics) to assess the performance Of such a score or test. By using the same data to fit the logistic regression and calculate the ROC curve we overestimate the performance that the score would give if validated on a sample of future cases. This overestimation is studied and we propose a correction for the ROC curve and the area under the curve. The methods axe illustrated through way of two medical examples and a simulation study, and we show that the overestimation can be quite substantial for small sample sizes. The idea of shrinkage pertains to the notion that by including some prior information about the data under study we can improve prediction. Until now, the study of shrinkage has almost exclusively been concentrated on continuous measurements. We propose a methodology to study shrinkage for logistic regression modelling of categorical data with a binary response. Categorical data with a large number of levels is often grouped for modelling purposes, which discards useful information about the data. By using this information we can apply Bayesian methods to update model parameters and show through examples and simulations that in some circumstances the updated estimates are better predictors than the model.