Use this URL to cite or link to this record in EThOS:
Title: Clustering methodology for bivariate circular data with application to protein dihedral angles
Author: Abushilah, Samira Faisal Hathoot
ISNI:       0000 0004 7971 9703
Awarding Body: University of Leeds
Current Institution: University of Leeds
Date of Award: 2019
Availability of Full Text:
Access from EThOS:
Full text unavailable from EThOS. Thesis embargoed until 01 Nov 2024
Access from Institution:
This thesis focusses on the development of statistical methodologies that can deal with bivariate circular data in the context of protein bioinformatics. Circular data differs from traditional linear data and statistical methods to handle the unique nature of this type of data are relatively new and are still under development. In circular data we focus on the dihedral angles that describe the conformation of the protein backbone. There are many problems related to circular data, and in this research we focus on some of them. Although experimental biological techniques can determine the structure and function of protein, such techniques are expensive and very time-consuming. Clustering of amino acids remains a challenging problem in protein bioinformatics which can help to predict whether a substitution of one amino acid by another has an essential impact on the protein structure, hence its function. Various researchers have attempted to cluster amino acids using physical properties, we regard this as suboptimal when the protein structure and function is the main interest. Therefore, we firstly propose a novel methodology to cluster groups of bivariate circular data and this is used to cluster 20 amino acids by considering the dissimilarity in the bivariate distributions of the dihedral angles. This dissimilarity can be expressed as a p-value of a permutation test for any pair of amino acids and we use this to obtain our own clusters. This clustering is then compared to other amino acid classifications using similarity indices. The above mentioned p-values can be obtained by a permutation test which for large sample sizes takes much computational time. Consequently, we secondly consider two novel homogeneity tests and develop distributional results based on theoretical asymptotic considerations on the distribution of the new proposed test statistic. The properties and distributions of our parametric tests are investigated and their performance is examined using simulated data (normal samples and von Mises samples). One of the tests is applied also to our real data, protein dihedral angles, for which clustering is carried out as before. It is also biologically important to know the properties of amino acids, where these characteristics exert an effect on the biological activity of protein and on its structure. In biochemistry, it is well known that the structure of some molecules, such as proteins, DNA and RNA, can be described in terms of conformational angles, for proteins these angles could be dihedral angles. Since each amino acid corresponds to a pair of dihedral angles, then the pattern of dihedral angles distribution across proteins is one way to determine amino acid characteristics. Therefore, we thirdly develop an approach to kernel density estimation on the torus and this is used to estimate the distribution of dihedral angles, which belong to each amino acid across proteins. An initial step requires choice of two smoothing parameters, which we investigate. Then, the estimated bivariate kernel density under the two smoothing parameters can be processed using mathematical morphology to partition the sample space of densities. By using this methodology, a researcher can divide the bivariate circular data into groups without being given the number of clusters a priori.
Supervisor: Gusnanto, Arief ; Taylor, Charles C. Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available