High-throughput image analysis for proteomics
The quest for high-throughput proteomics in recent years has revealed a number of critical issues. Whilst improved 2-D gel electrophoresis (2-DE) sample preparation, staining and imaging issues are being actively pursued by industry, reliable high-throughput spot matching and quantification remains a significant bottleneck in the bioinformatics pipeline. The flow of data to mass spectrometry through robotic spot excision and protein digestion so far has been restricted. The purpose of this thesis is to describe a computational framework that is suitable for large-scale image mining and statistical cross-validation of 2-DE. This is in response to the development of a statistical 2-DE ontology that underpins the emerging Human Proteome Organisation’s Proteomics Standards Initiative General Proteomics Format (HUPO PSI GPS).The thesis begins with a comprehensive review of the role of bioinformatics in 2-DE. A platform for Statistical Expression Analysis (SEA) is then proposed, which performs the analysis statistically and simultaneously between sets of gel replicates. The method circumvents the drawback of traditional approaches in using symbolic representation of spots at the very early stages of the analysis, which despite greatly simplifying data handling and management, introduces persistent errors due to inaccuracies in spot modelling and matching. With the proposed SEA, the method is fully automated and small insignificant expression changes over one gel pair can be revealed when reinforced by the same consistent changes in others. To achieve this, we present an integrated image registration and bias field correction technique to normalise sets of gels for direct comparison. With the Robust Advanced Image Normalisation (RAIN) algorithm, image intensity distribution is used, rather than selected features. A new method of volume-invariant warping is proposed which ensures the volume of protein expression under transformation is preserved. In this thesis, the relative bias field between reference and sample gel is modelled by multiplicative and additive piecewise B-spline surfaces. To overcome the massive computational burden SEA entails, a cluster computing framework has been developed. It utilises Condor middleware, JPEG-LS lossless image compression and probabilistic task replication for novel distributed image processing. Also, the image analysis has been designed and implemented with GPGPU (general purpose computation on graphics processors), which brings a significant increase in computational performance. The validation of the proposed techniques was performed with 2-DE image data produced by University College Dublin, along with large datasets formed from the HUPO Human Brain Proteome Project (HBPP). The HBPP is a worldwide academic and industrial collaboration of neuroproteomic centres, with the aim of characterising the human and mouse brain proteomes for studying human neurodegenerative diseases. Compared with existing methods, results from the proposed analysis framework show substantial improvements in computational throughput and expression analysis sensitivity.