Use this URL to cite or link to this record in EThOS:
Title: Statistical shape analysis of large molecular data sets
Author: Hennessey, Anthony
ISNI:       0000 0004 7233 8481
Awarding Body: University of Nottingham
Current Institution: University of Nottingham
Date of Award: 2018
Availability of Full Text:
Access from EThOS:
Access from Institution:
Protein classification databases are widely used in the prediction of protein structure and function, and amongst these databases the manually-curated Structural Classification of Proteins database (SCOP) is considered to be a gold standard. In SCOP, functional relationships are described by hyperfamily and superfamily categories and structural relationships are described by family, species and protein categories. We present a method to calculate a difference measure between pairs of proteins that can be used to reproduce SCOP2 structural relationship classifications, and that can also be used to reproduce a subset of functional relationship classifications at the superfamily level. Calculating the difference measure requires first finding the best correspondence between atoms in two protein configurations. The problem of finding the best correspondence is known as the unlabelled, partial matching problem. We consider the unlabelled, partial matching problem through a detailed analysis of the approach presented in Green and Mardia (2006). Using this analysis, and applying domain-specific constraints, we develop a new algorithm called GProtA for protein structure alignment. The proposed difference measure is constructed from the root mean squared deviation of the aligned protein structures and a binary similarity measure, where the binary similarity measure takes into account the proportions of atoms matching from each configuration. The GProtA algorithm and difference measure are applied to protein structure data taken from the Protein Data Bank. The difference measure is shown to correctly classify 62 of a set of 72 proteins into the correct SCOP family categories when clustered. Of the remaining 9 proteins, 2 are assigned incorrectly and 7 are considered indeterminate. In addition, a method for deriving characteristic signatures for categories is proposed. The signatures offer a mechanism by which a single comparison can be made to judge similarity to a particular category. Comparison using characteristic signatures is shown to correctly delineate proteins at the family level, including the identification of both families for a subset of proteins described by two family level categories.
Supervisor: Not available Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available
Keywords: QA 75 Electronic computers. Computer science ; QP501 Animal biochemistry