Title:

Statistical analysis of crystallographic data

The Cambridge structural database (CSD) is a vast resource for crystallographic information. As of 1st January 2009 there are more than 469,611 crystal structures available in the CSD. This work is centred on a program dSNAP which has been developed at the University of Glasgow. dSNAP is a program that uses statistical methods to group fragments of molecules into groups that have a similar conformation. This work is aimed at applying methods to reduce the number of variables required to describe the geometry of the fragments mined from the CSD. To this end, the geometric definition employed by dSNAP was investigated. The default definition is total geometries which are made up of all angles and all distances, including all nonbonded distances and angles. This geometric definition was investigated in a comparative manner with four other definitions. There were all angles, all distances, bonded angles and distances and bonded angles, distances and torsion angles. These comparisons show that nonbonded information is critical to the formation of groups of fragments with similar conformations. The remainder of this work was focused in reducing the number of variables required to group fragments having similar conformations into distinct groups. Initially a method was developed to calculate the area of triangles between three atoms making up the fragment. This was employed systematically as a means of reducing the total number of variables required to describe the geometry of the fragments. Multivariate statistical methods were also applied with the aim of reducing the number of variables required to describe the geometry of the fragment in a systematic manner. The methods employed were factor analysis and sparse principal components analysis. Both of these methods were used to extract important variables from the original default geometric definition, total geometries. The extracted variables were then used as input for dSNAP and were compared with the original output. Biplots were used to visualise the variables describing the fragments. Biplots are multivariate analogues to scatter plots and are used to visualise how the fragments are related to the variables describing them. Owing to the large number of variables that make up the definition factor analysis was applied to extract the important variables before the biplot was calculated. The biplots give an overview of the correlation matrix and using these plots it is possible to select variables that are influencing the formation of clusters in dSNAP .
