Use this URL to cite or link to this record in EThOS:
Title: Use and examination of convolutional neural networks for scene understanding
Author: Jetley, Saumya
ISNI:       0000 0004 7971 5569
Awarding Body: University of Oxford
Current Institution: University of Oxford
Date of Award: 2018
Availability of Full Text:
Access from EThOS:
Full text unavailable from EThOS. Please try the link below.
Access from Institution:
This thesis concerns itself with the use and examination of convolutional neural networks in the context of visual scene understanding. Towards this, the first part of the thesis proposes novel extensions to vanilla CNNs. These extensions attempt to incorporate domain knowledge into the computational framework of CNNs in order to better adapt them to targeted visual tasks. We begin by integrating a class prototypical embedding space in a conventional classification network, whereby real world object samples are recognised by matching them to correct class visual prototypes in this space. This use of side information is not only able to improve the network classification performance on the categories seen during training but also boosts the recognition of similar categories that are unseen during training. Likewise, we propose a deep neural model for real time instance segmentation that makes use of an intermediate shape embedding space. This continuous and learned latent space allows unseen input object images to be matched to new and realistic shape masks at test time. In the follow-up work, we draw inspiration from the recent advances in network design and training for object detection and segmentation. The assimilation of these techniques in our instance segmentation system allows us to further improve its accuracy while still operating in real time. In yet another application of CNNs, to the task of human saliency estimation, we revisit the interpretation of the task as a competitive process: humans look at some regions of an image at the cost of not looking at others. We thus model the output saliency maps as spatial probability distributions, and propose the use of losses that are suitable for measuring distances between probability distributions to train a deep network for the task. This formulation yields significant gains in terms of the network predictive performance as measured on an array of saliency metrics. After augmenting and applying conventional CNNs to a variety of visual tasks, the later part of the thesis shifts its focus to an examination of these networks. We begin by investigating classification CNNs at a qualitative level by modelling network visual attention. In particular, we formulate attention modules that can be trained in tandem with the network weights to optimise the end goal of image classification. The resulting spatial attention scores, associated with the local features at predefined network layers, are able to identify the semantic parts of the input images. In other words, the attention maps are able to suppress the irrelevant and highlight the relevant regions of the input images in a way that lends greater transparency to the inference procedure of nets and also boosts the output classification accuracy. Further, the binarised maps serve as useful weakly-supervised foreground segmentation masks. We then perform a more principled analysis of the class decision functions learned by classification CNNs by contextualising an existing geometrical framework for network decision boundary analysis. Our research uncovers some very intriguing yet simplistic facets of the class score functions learned by these networks that explain their adversarial vulnerability. We identify the fact that specific input image space directions tend to be associated with fixed class identities. This means that simply increasing the magnitude of correlation between an input image and a single image space direction causes the nets to believe that more (or less) of the class is present. This allows us to provide a new perspective on the existence of universal adversarial perturbations. Further, the input image space directions which the networks use to achieve their classification performance are the same along which they are most vulnerable to attack; the vulnerability arises from the rather simplistic non-linear use of the directions. Thus, as it stands, the performance and vulnerability of these nets are closely entwined. Various other notable observations emerge from our findings and are discussed in a greater detail in the thesis. We conclude by highlighting some open questions in an effort to inform future work in the field.
Supervisor: Torr, Philip Sponsor: HELIOS (FP7-IDEAS-ERC)
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available
Keywords: Neural networks ; Computer vision--Research ; Computer vision--Mathematical models ; Machine learning ; Computer vision ; Convolutional neural networks