Use this URL to cite or link to this record in EThOS:
Title: Using sequence data to investigate the functional design of proteins
Author: Puszkarska, Anna
ISNI:       0000 0004 9353 9450
Awarding Body: University of Cambridge
Current Institution: University of Cambridge
Date of Award: 2020
Availability of Full Text:
Access from EThOS:
Full text unavailable from EThOS. Thesis embargoed until 01 Jan 2400
Access from Institution:
Understanding which sequence features enable a given protein to perform a specific function is a long-standing challenge in the field of molecular biology. In particular, the detection of functional specificity among paralogous proteins is challenging, since their common molecular origin causes only subtle differences between the variants that are difficult to detect based on the observable characteristics. For example, the set of paralogous polypeptides present in the genome of modern vertebrates give rise to the variety of collagen proteins. These proteins preserve a high degree of sequential and structural homology, yet are optimised by evolutionary processes to perform their specific function - assemble into widely diverse biological materials, such as fibrils or networks. The thesis exploits statistical modelling of protein sequence data to shine light on the relationship between protein sequence and function. Specifically, the thesis develops approaches to find sequence design principles which determine the specificity of protein paralogues that have diverged in function. I will show that this can be achieved by investigating evolutionary sequence variation using a probabilistic framework. The new approach to examine the importance of each amino acid in the protein primary sequence is developed. We find that the functionally important amino acids can be grouped into two clusters: (i) those shared among all paralogue sequences responsible for common features, and (ii) those specific to each group which enable functional specificity. I use data sets of orthologous collagen sequences from genomic research to build sequence models that represent each collagen paralogue variant and use these models to carry out comparative analysis. Adaptational dependencies among seventeen types of α-paralogue sequences from two functional groups are analysed. Moreover, a model of intermolecular interactions between fibrillar collagen trimers is proposed to show that the phenotype of the supra-molecular fibrillar structure is fully encoded in the primary sequences of the collagen proteins and can be predicted purely on the basis of simple predictive rules for the interaction between amino acid residues. Finally, I use statistical learning approaches to model the activity of peptide hormones, and use the resulting models to design novel hormone sequences with improved functional properties.
Supervisor: Colwell, Lucy ; Duer, Melinda Sponsor: Raymond and Beverly Sackler Fund for Physics of Medicine ; University of Cambridge ; European Research Council ; Simons Foundation
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
Keywords: protein sequence analysis ; statistical modelling ; evolutionary sequence variation ; collagen