Use this URL to cite or link to this record in EThOS:
Title: Fast static analysis for compile-time restructuring of application parallelism on Graphics Processing Units
Author: Stawinoga, Nicolai
ISNI:       0000 0004 8499 462X
Awarding Body: Imperial College London
Current Institution: Imperial College London
Date of Award: 2019
Availability of Full Text:
Access from EThOS:
Access from Institution:
Parallelism is everywhere, with co-processors such as Graphics Processing Units (GPUs) accelerating the performance of applications such as training deep-learning neural networks, climate forecasting, bitcoin mining, medical imaging, or data analytics on platforms ranging from desktop computers to cloud computing and high performance clusters to mobile phones. Code optimisations enable realising the available performance of such devices, and automating these optimisations enables performance portability of software between different architectures. In this thesis, we consider two code optimisations that can improve application performance by reducing the degree of hardware and software parallelism in a program execution: thread coarsening, which by merging threads reduces the number of threads launched, and artificial occupancy reduction, which limits the number of threads simultaneously processed by allocating superfluous resources. We show how occupancy prediction through re-compilation can enable the selection of near-optimal coarsening factors at compile-time, by which thread coarsening can be applied in a fully automated manner without requiring auto-tuning. We demonstrate that our approach can achieve a maximum speedup of 5.08x (1.30x average) across three different NVidia GPU architectures, two modes of coarsening, different problem sizes, and for code pre-optimised to different degrees. When trying to predict the likely effects of thread coarsening, it is important to consider the effects it might have on cache pressure. We describe how a fast static analysis based on partial symbolic execution can be implemented to identify cache line re-use in programs. We demonstrate how this heuristic approach can improve on the runtime and memory requirements of a more extensive re-use distance analysis by several orders of magnitude, causing it to be sufficiently light-weight for run-time execution. We show that the analysis is able to identify kernels that are likely to experience an increase in cache pressure after coarsening. We explore the interaction of thread coarsening and artificial occupancy reduction, which can have negative effects on cache pressure and processor workload, respectively. We show that the two optimisation techniques can cancel these out when applied in combination, and yield a performance improvement of 8% in some cases. We investigate whether the cache line re-use analysis can identify candidates for artificial occupancy reduction.
Supervisor: Field, Tony Sponsor: Engineering and Physical Sciences Research Council
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral