Use this URL to cite or link to this record in EThOS:
Title: Bioinformatic analysis of genomic sequencing data : read alignment and variant evaluation
Author: Frousios, Kimon
ISNI:       0000 0004 5368 1209
Awarding Body: King's College London
Current Institution: King's College London (University of London)
Date of Award: 2014
Availability of Full Text:
Access from EThOS:
Access from Institution:
The invention and rise in popularity of Next Generation Sequencing technologies has led to a steep increase of sequencing data and the rise of new challenges. This thesis aims to contribute methods for the analysis of NGS data, and focuses on two of the challenges presented by these data. The first challenge regards the need for NGS reads to be aligned to a reference sequence, as their short length complicates direct assembly. A great number of tools exist that carry out this task quickly and efficiently, yet they all rely on the mere count of mismatches in order to assess alignments, ignoring the knowledge that genome composition and mutation frequencies are biased. Thus, the use of a scoring matrix that incorporates the mutation and composition biases observed among humans was tested with simulated reads. The scoring matrix was implemented and incorporated into the in-house algorithm REAL, allowing side-by-side comparison of the performance of the biased model and the mismatch count. The algorithm REAL was also used to investigate the applicability of NGS RNA-seq data to the understanding of the relationship between genomic expression and the compartmentalisation of genomic base composition into isochores. The second challenge regards the evaluation of the variants (SNPs) that are discovered by sequencing. NGS technologies have caused a sharp rise in the rate with which new SNPs are discovered, rendering impossible the experimental validation of each one. Several tools exist that take into account various properties of the genome, the transcripts and the protein products relevant to the location of a SNP and attempt to predict the SNP's impact. These tools are valuable in screening and prioritising SNPs likely to have a causative association with a genetic disease of interest. Despite the number of individual tools and the diversity of their resources, no attempt had been made to draw a consensus among them. Two consensus approaches were considered, one based on a very simplistic vote majority of the tools considered, and one based on machine learning. Both methods proved to offer highly competitive classification both against the individual tools and against other consensus methods that were published in the meantime.
Supervisor: Schlitt, Thomas ; Iliopoulos, Costas Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available