A bioinformatics and molecular analysis of antigenic variation in African trypanosomes
The aim of this thesis was to further our knowledge about the contribution of silent alleles on megabase chromosomes to the late stages of trypanosome infection and test the hypothesis that this contribution takes shape in a hierarchy of expression due to differences between alleles in terms both of flanking regions and coding sequence. This was achieved through a combination of bioinformatics and molecular studies. The initial approach was to undertake an extensive manual curation of the available VSG archive; this endeavour resulted in establishment of a fertile collaboration with the Trypanosoma brucei genome sequencing project, and in creation, with the aid of P. Ward and S. Menon, of a dedicated web-based tool to handle and query curated VSG genes. Out of an updated estimate of ~1600 VSG genes, 940 (between half and three quarters) were annotated and shown to be arranged in subtelomeric arrays and to be largely present as pseudogenes (~90%). By considering separately the hypervariable N-terminal domain (three types, A, B and C) and the more conserved C-terminal domain (types 1 to 4, with two additional types identified in this study), it appeared that most of the degeneracy lay in the C-terminal domain. This suggested that N-terminal domains (one third of them being intact) would be utilised by a process of segmental gene conversion yielding hybrid genes, by recombination with functional C-terminal ends resident at the expression site. Under the assumption that “order” within the genome (the presence of patterns within the VSG archive) helps inform “order” in VSG expression (a hierarchy based on different activation probabilities), it was somewhat surprising to detect little evidence of clear substructuring within the archive: no “classes” of VSGs could be identified, based on coding sequence and flanking sequence features. In keeping with the observed high level of divergence within the VSG archive, clear orthologue groups (here defined as alleles sharing >60% amino acid sequence identity) were found not to include more than three to four members and to be scattered at random across the arrays. Putative functional genes could not be separated into groups based on expected differences in activation probabilities, such as a different number of upstream 70-bp repeats, shown to be involved in copying silent alleles to the expression site.