Title:

Variable selection and interpretation in principal component analysis

In many research fields such as medicine, psychology, management and zoology, large numbers of variables are sometimes measured on each individual. As a result, the researcher will end up with a huge data set consisting of large number of variables, say p. Using this collected data set in any statistical analyses may cause several troubles. Thus, many cases demand a prior selection of the best subset of variables of size q, with q « p, to represent the entire data set in any data analysis. Evidently, the best subset of size q for some specified objective can always be determined by investigating systematically all possible subsets of size q, but such a procedure may be computationally difficult especially for large p. Also, in many applications, when a Principal Component Analysis (PCA) is done on a large number of variables, the resultant Principal Components (PCs) may not be easy to interpret. To aid interpretation, it is useful to reduce the number of variables as much as possible whilst capturing most of the variation of the complete data set, X. Thus, this thesis is aimed to reduce the studied number of variables in a given data set by selecting the best q out of p measured variables to highlight the main features of a structured data set as well as aiding the simultaneous interpretation of the first k (covariance or correlation) PCs. This desired aim can be achieved by generating several artificial data sets having different types of structures such as nearly independent variables, highly dependent variables and clustered variables. Then, for each structure, several Variable Selection Criteria (VSC) are applied in order to retain some subsets of size q. The efficiencies of these subsets retained are measured in order to determine the best criteria for retaining subsets of size q. Finally, the general results obtained from the entire artificial data analyses are evaluated on some real data sets having interesting covariance and correlation structures.
