Use this URL to cite or link to this record in EThOS:
Title: Learning shape from images
Author: Wiles, Olivia
ISNI:       0000 0004 9355 2161
Awarding Body: University of Oxford
Current Institution: University of Oxford
Date of Award: 2020
Availability of Full Text:
Access from EThOS:
Full text unavailable from EThOS. Please try the link below.
Access from Institution:
This thesis explores how to harness neural networks to learn 3D structure from visual data. Being able to estimate 3D structure is of vital importance in many applications such as VR/AR, medical imaging, and computational photography. Recently, neural networks have been shown to be effective at learning complex functions from data, including in a 3D setting. The central question this thesis addresses is how to overcome the limitations inherent in a strongly supervised application of neural networks using given 3D information. There are three tasks we explore that require 3D understanding or are a step in a larger 3D system: (i) correspondence estimation, (ii) single image 3D prediction, and (iii) view synthesis. A key tool used throughout this thesis is self-supervision. The core idea of self-supervision for learning about 3D structure is to train models on varied and challenging datasets without requiring knowledge of the true, underlying 3D structure. The first task considered is how to improve correspondence estimation between images. Instead of following standard approaches that condition descriptors on a single image, we use efficient, complex neural networks to condition the descriptors on both images and achieve state of the art or comparable results. The second task, to which we devote the majority of this thesis, is how to estimate and represent 3D structure from a single image without requiring knowledge of the true 3D structure. We explore three methods for predicting 3D information using a self-supervised training approach; these methods vary in terms of the domain, self-supervised setup and 3D representation. We first use silhouettes and depth to train an embedding to represent 3D structure on a dataset of realistic sculptures that can incorporate non photometrically consistent input views. Being able to incorporate images of the same object with different materials is not feasible with traditional approaches. We then use geometric correspondences to train a single image depth predictor on a dataset of real images of sculptures. Training a single image depth predictor is challenging using standard methods, as the images vary dramatically in terms of the context (for example, whether taken at day or night, or in autumn or winter), illumination, and pose. Finally, we use videos of people (or another object class) moving over time to learn class-based embeddings that encode the pose and structure of the corresponding instances of the class. This embedding is then used to predict object-specific properties (such as landmarks or expression) and achieves high-quality results for a variety of object classes, including humans, faces, and animals. The final task focuses on how to apply self-supervision to a downstream task, view synthesis, which requires implicitly understanding 3D structure. Given an image of a scene or a face, the task is to generate a new view of that scene or face. For a scene, we generate new images according to an input viewpoint change. For faces, we generate new images by manipulating the pose or expression of the face according to another input face or another modality (such as pose or audio). These self-supervised methods achieve high-quality results on challenging domains by implicitly learning about the 3D structure of a scene, without knowing the true 3D structure.
Supervisor: Not available Sponsor: Engineering and Physical Sciences Research Council
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available
Keywords: Computer Vision ; Machine Learning