Use this URL to cite or link to this record in EThOS:
Title: Deep learning for text spotting
Author: Jaderberg, Maxwell
ISNI:       0000 0004 5359 5108
Awarding Body: University of Oxford
Current Institution: University of Oxford
Date of Award: 2015
Availability of Full Text:
Access from EThOS:
Full text unavailable from EThOS. Please try the link below.
Access from Institution:
This thesis addresses the problem of text spotting - being able to automatically detect and recognise text in natural images. Developing text spotting systems, systems capable of reading and therefore better interpreting the visual world, is a challenging but wildly useful task to solve. We approach this problem by drawing on the successful developments in machine learning, in particular deep learning and neural networks, to present advancements using these data-driven methods. Deep learning based models, consisting of millions of trainable parameters, require a lot of data to train effectively. To meet the requirements of these data hungry algorithms, we present two methods of automatically generating extra training data without any additional human interaction. The first crawls a photo sharing website and uses a weakly-supervised existing text spotting system to harvest new data. The second is a synthetic data generation engine, capable of generating unlimited amounts of realistic looking text images, that can be solely relied upon for training text recognition models. While we define these new datasets, all our methods are also evaluated on standard public benchmark datasets. We develop two approaches to text spotting: character-centric and word-centric. In the character-centric approach, multiple character classifier models are developed, reinforcing each other through a feature sharing framework. These character models are used to generate text saliency maps to drive detection, and convolved with detection regions to enable text recognition, producing an end-to-end system with state-of-the-art performance. For the second, higher-level, word-centric approach to text spotting, weak detection models are constructed to find potential instances of words in images, which are subsequently refined and adjusted with a classifier and deep coordinate regressor. A whole word image recognition model recognises words from a huge dictionary of 90k words using classification, resulting in previously unattainable levels of accuracy. The resulting end-to-end text spotting pipeline advances the state of the art significantly and is applied to large scale video search. While dictionary based text recognition is useful and powerful, the need for unconstrained text recognition still prevails. We develop a two-part model for text recognition, with the complementary parts combined in a graphical model and trained using a structured output learning framework adapted to deep learning. The trained recognition model is capable of accurately recognising unseen and completely random text. Finally, we make a general contribution to improve the efficiency of convolutional neural networks. Our low-rank approximation schemes can be utilised to greatly reduce the number of computations required for inference. These are applied to various existing models, resulting in real-world speedups with negligible loss in predictive power.
Supervisor: Zisserman, Andrew; Vedaldi, Andrea Sponsor: Engineering and Physical Sciences Research Council
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available
Keywords: Image understanding ; Information engineering ; machine learning ; deep learning ; neural networks ; text recognition ; text spotting ; text detection