Use this URL to cite or link to this record in EThOS:
Title: Delving deep into fine-grained sketch-based image retrieval
Author: Pang, Kaiyue
ISNI:       0000 0004 9355 663X
Awarding Body: Queen Mary University of London
Current Institution: Queen Mary, University of London
Date of Award: 2020
Availability of Full Text:
Access from EThOS:
Access from Institution:
To see is to sketch. Since prehistoric times, people use sketch-like petroglyphs as an effective communicative tool which predates the appearance of language tens of thousands of years ago. This is even more true nowadays that with the ubiquitous proliferation of touchscreen devices, sketching is possibly the only rendering mechanism readily available for all to express visual intentions. The intriguing free-hand property of human sketches, however, becomes a major obstacle when practically applied – humans are not faithful artists, the sketches drawn are iconic abstractions of mental images and can quickly fall off the visual manifold of natural objects. When matching discriminatively with their corresponding photos, this problem is known as finegrained sketch-based image retrieval (FG-SBIR) and has drawn increasing interest due to its potential commercial adoption. This thesis delves deep into FG-SBIR by intuitively analysing the intrinsic unique traits of human sketches and make such understanding importantly leveraged to enhance their links to match with photos under deep learning. More specifically, this thesis investigates and has developed four methods for FG-SBIR as follows: Chapter 3 describes a discriminative-generative hybrid method to better bridge the domain gap between photo and sketch. Existing FG-SBIR models learn a deep joint embedding space with discriminative losses only to pull matching pairs of photos and sketches close and push mismatched pairs away, thus indirectly align the two domains. To this end, we introduce a i generative task of cross-domain image synthesis. Concretely when an input photo is embedded in the joint space, the embedding vector is used as input to a generative model to synthesise the corresponding sketch. This task enforces the learned embedding space to preserve all the domain invariant information that is useful for cross-domain reconstruction, thus explicitly reducing the domain gap as opposed to existing models. Such an approach achieves the first near-human performance on the largest FG-SBIR dataset to date, Sketchy. Chapter 4 presents a new way of modelling human sketch and shows how such modelling can be integrated into existing FG-SBIR paradigm with promising performance. Instead of modelling the forward sketching pass, we attempt to invert it. We model this inversion by translating iconic free-hand sketches to contours that resemble more geometrically realistic projections of object boundaries and separately factorise out the salient added details. This factorised rerepresentation makes it possible for more effective sketch-photo matching. Specifically, we propose a novel unsupervised image style transfer model based on enforcing a cyclic embedding consistency constraint. A deep four-way Siamese model is then formulated to importantly utilise the synthesised contours by extracting distinct complementary detail features for FG-SBIR. Chapter 5 extends the practical applicability of FG-SBIR to work well beyond its training categories. Existing models, while successful, require instance-level pairing within each coarsegrained category as annotated training data, leaving their ability to deal with out-of-sample data unknown. We identify cross-category generalisation for FG-SBIR as a domain generalisation problem and propose the first solution. Our key contribution is a novel unsupervised learning approach to model a universal manifold of prototypical visual sketch traits. This manifold can then be used to paramaterise the learning of a sketch/photo representation. Model adaptation to novel categories then becomes automatic via embedding the novel sketch in the manifold and updating the representation and retrieval function accordingly. Chapter 6 challenges the ImageNet pre-training that has long been considered crucial by the FG-SBIR community due to the lack of large sketch-photo paired datasets for FG-SBIR training, and propose a self-supervised alternative for representation pre-training. Specifically, we consider the jigsaw puzzle game of recomposing images from shuffled parts. We identify two ii key facets of jigsaw task design that are required for effective performance. The first is formulating the puzzle in a mixed-modality fashion. Second we show that framing the optimisation as permutation matrix inference via Sinkhorn iterations is more effective than existing classifier instantiation of the Jigsaw idea. We show for the first time that ImageNet classification is unnecessary as a pre-training strategy for FG-SBIR and confirm the efficacy of our jigsaw approach.
Supervisor: Not available Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available