Use this URL to cite or link to this record in EThOS:
Title: Deep learning with synthetic, temporal, and adversarial supervision
Author: Gupta, Ankush
ISNI:       0000 0004 7960 0067
Awarding Body: University of Oxford
Current Institution: University of Oxford
Date of Award: 2018
Availability of Full Text:
Access from EThOS:
Full text unavailable from EThOS. Please try the link below.
Access from Institution:
In this thesis we explore alternatives to manually annotated training examples for supervising the training of deep learning models. Specifically, we develop methods for learning under three different supervision paradigms, namely - (1) synthetic data, (2) temporal data, and (3) adversarial supervision for learning from unaligned examples. The dominant application domain of our work is text spotting, i.e. detection and recognition of text instances in images. We learn text localisation networks on synthetic data, and harness an adversarial discriminator for training text recognition networks using no paired training examples. Further, we exploit the changing pose of objects in temporal sequences (videos) to learn object landmark detectors. The unifying objective is to scale deep learning methods beyond manually annotated training data. We develop a large-scale, realistic synthetic scene text dataset. Armed with this large annotated dataset of scene images, we train a novel, fast fully-convolutional text detection network, and show excellent performance on real images. This generalisation from synthetic to real images, confirms the verisimilitude of our rendering process. The dataset, SynthText in the Wild, has been widely adapted by the research community, and has enabled the development of end-to-end text spotting models. While synthetic text can be readily generated, it needs to be adapted for the specific application domain. However, unaligned examples of text-images, and valid language sentences are abundant. With this in mind, we develop a method for text recognition which learns from such unaligned data. We cast the text recognition problem as one of aligning the conditional distribution of strings predicted from given text images, with lexically valid strings. This alignment is induced through an adversarial discriminator which tries to distinguish the predicted and real text strings apart. Our method achieves excellent text recognition accuracy, using no labelled training examples. Temporal sequences (videos) of objects encode changes in their pose. We develop a method to harness this, and learn object landmark detectors, which consistently track object parts across different poses and instances. We achieve this by conditionally generating a future frame given a past frame, and a sparse keypoint like (learnt) representation extracted from the future frame. We demonstrate generality of our method by learning landmarks for human faces (where we outperform existing landmark detectors), articulated human body, and rigid 3D objects, with no modification to the method. Finally, we propose one-step inductive training for improving generalisation in recurrent neural networks to longer sequences. We restrict the recurrent state to a spatial memory map which tracks the regions of the image which have been accounted for, and train the network for valid evolution of this map. We show excellent generalisation to much longer sequences on two sequential visual recognition tasks - joint localisation and recognition of multiple lines of text, and counting objects in aerial images.
Supervisor: Vedaldi, Andrea ; Zisserman, Andrew Sponsor: Engineering and Physical Sciences Research Council (EPSRC)
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available