Title:
|
The selection of covariates for the relationship between blood-lead and ability
|
This thesis arose from a problem in the analysis of data from the Edinburgh Lead Study. The data were to be used to estimate the influence of children's blood lead levels on their mental abilities, controlling for other factors which might confound this relationship. The other factors were summarised as a set of covariate scores, and the question arose as to which of these scores should be included in a multiple regression whose purpose was to estimate the coefficient of blood-lead. This problem has arisen in other studies of the influence of lead on ability, and a variety of solutions have been implemented. The statistical and epidemiological literature offers little guidance. The problem is formalised by proposing regression models with various assumptions. Expressions are derived for the mean-square-error of the parameter of special interest (here the blood-lead coefficient) in terms of quantities which can be calculated from the data. Various stepwise procedures are proposed for selecting a sub-set of covariates to include in the regression equation. These include the usual stepwise procedures, as well as new ones based on the various meansquare-error criteria and on changes in the coefficient of interest. These procedures are studied for the data from the Edinburgh Lead Study and evaluated by simulation in different ways. The potential for variance reduction from sub-models, compared to including all covariates, is a function of the multiple correlation between the variable of special interest and the variables which could be omitted from the model. The results suggest that, unless this correlation exceeds 0.2, inferences should be based on a regression with the full set of covariates. The greatest benefit is obtained from sub-set selection procedures when the multiple correlation is increased as a result of a decrease in the residual degrees of freedom. In these circumstances the multiple correlation will be high, but its value will fall when the usual adjustment for degrees of freedom is applied. The simulation results suggest that sub-set selection will be beneficial when the residual degrees of freedom for the full model are less than three time the number of covariates. The method which performed best was to select, at each step, the variable which made the largest change in the coefficient of interest. Stopping rules for this criterion are propped. This method was less prone than the other methods to underestimate the variance of the coefficient of interest, when this is evaluated in the usual way for the final model. But it performed badly and underestimated this variance, for artificial data where the population multiple correlation between the variable of special interest and the covariates was high. This suggests that sub-set selection should not be used when the estimated multiple correlation adjusted for degrees of freedom is high. These criteria applied to the Lead Study data would suggest that the effect of lead on ability should be assessed by adjusting for all the covariate scores.
|