Comparative genome analysis to reveal protein evolution
The completion of a substantial number of complete genome sequencing initiatives has produced more than a million protein sequences. Analysis of these protein sequences is possible using recent advances in computing and bioinformatics techniques. This thesis describes a novel automated protein classification protocol which groups proteins into families and identifies protein domain architectures via domain assignment. This data is presented in the Gene3D database which is used for subsequent analysis. The analysis of the distribution of protein family and protein domain data shows a power-law like distribution that is typically seen in many biological data distributions and is indicative of the small world networks that underlie biological systems biology. Kingdom distribution of superfamilies and protein families in Gene3D has been used to describe the evolutionary mechanisms that determine genome diversity through protein diversity. Domain occurrence profiles have been used to identify protein domain superfamilies that are correlated with genome size in bacteria. These superfamilies are shown to exhibit a balance between metabolic and regulatory roles along microeconomic principles that may determine bacterial genome size. Domain families identified in Gene3D enable a determination of the total number of protein folds in nature. Sub-clustering of domain families permits domain family sub-cluster occurrence profiles to be determined. These profiles are shown to be capable of detecting correlations and anti-correlations between domain families that are undetectable using superfamily occurrence profiles alone. Clusters of correlated domain subclusters are shown to identify functionally linked clusters of proteins. Finally, the data in Gene3D is used to functionally annotate the CATH database and provide functional predictions for un-annotated proteins, providing more comprehensive functional repertoire and greater accuracy than other functional prediction methods.