Use this URL to cite or link to this record in EThOS:
Title: Clustering large raw DNA sequencing datasets by species of origin using signature features of genomic sequence composition
Author: Hodges, Tobias
ISNI:       0000 0004 2728 6993
Awarding Body: University of York
Current Institution: University of York
Date of Award: 2012
Availability of Full Text:
Access from EThOS:
Access from Institution:
The establishment of high-throughput massively-parallel DNA sequencing technology has broadened the scope of metagenomics. The size and complexity of the datasets produced in such studies present considerable challenges. The aim of this project was to investigate the potential for genomic signature features to be applied to raw high-throughput sequencing reads generated from multi-species samples. Grouping reads according to the genome from which they originate could allow for the study of previously unknown or poorly- understood pathogens, and improve the performance of assembly of genome sequences from these reads. Genomic signatures were compared to find the best feature or combination for grouping reads by species of origin. A range of datasets were developed to provide an effective basis for such analysis. The performance of a number of clustering methods was also compared. The accuracy of grouping that could be achieved was evaluated, and the effect of such a grouping on the performance of sequence assembly was assessed. It was found that perfect species-specific grouping of raw sequencing data was outside of the scope of the approaches assessed here, but the enrichment of groups for reads from particular species was achievable. The single greatest obstacle to effective grouping was thought to be the short length of reads produced from current sequencing platforms. The individual assembly of grouped reads was found to produce results similar to those from assembling the dataset as a whole but with a reduction in the time required. The future of DNA sequencing is bright, with technology advancing at a startling pace, providing improvements in read length, dataset size and experimental run-time. It is hoped that these advancements will prove beneficial to the approaches investigated here, which are likely to remain useful as the size and complexity of datasets increases.
Supervisor: Ashton, Peter ; Boonham, Neil Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available