Title:

Linear programming algorithms for detecting separated data in binary logistic regression models

This thesis is a study of the detection of separation among the sample points in binary logistic regression models. We propose a new algorithm for detecting separation and demonstrate empirically that it can be computed fast enough to be used routinely as part of the fitting process for logistic regression models. The parameter estimates of a binary logistic regression model fit using the method of maximum likelihood sometimes do not converge to finite values. This phenomenon (also known as monotone likelihood or infinite parameters) occurs because of a condition among the sample points known as separation. There are two classes of separation. When complete separation is present among the sample points, iterative procedures for maximizing the likelihood tend to break down, when it would be clear that there is a problem with the model. However, when quasicomplete separation is present among the sample points, the iterative procedures for maximizing the likelihood tend to satisfy their convergence criterion before revealing any indication of separation. The new algorithm is based on a linear program with a nonnegative objective function that has a positive optimal value when separation is present among the sample points. We compare several approaches for solving this linear program and find that a method based on determining the feasibility of the dual to this linear program provides a numerically reliable test for separation among the sample points. A simulation study shows that this test can be computed in a similar amount of time as fitting the binary logistic regression model using the method of iteratively reweighted least squares: hence the test is fast enough to be used routinely as part of the fitting procedure. An implementation of our algorithm (as well as the other methods described in this thesis) is available in the R package safeBinaryRegression.
