Use this URL to cite or link to this record in EThOS:
Title: Visual retrieval for compound queries
Author: Zhong, Yujie
Awarding Body: University of Oxford
Current Institution: University of Oxford
Date of Award: 2018
Availability of Full Text:
Access from EThOS:
Full text unavailable from EThOS. Please try the link below.
Access from Institution:
There are now vast number of images available on the Internet or in personal collections. As a result, searching for images based on their visual content has many useful applications. In this thesis, we focus on compound query image retrieval. Namely, the query can consist of objects of different natures, a set of objects of the same type or multiple examples of an object, and the goal is to retrieve (based on visual content) images that match the query from a large image corpus. However, compound query retrieval is very challenging, as it may require the system to handle queries of different object types. Furthermore, the retrieval should be real-time with high performance. The first task we consider is to retrieve images containing both a target person and a target scene type from a large dataset of images. We propose a hybrid convolutional neural network architecture that produces place-descriptors that are aware of faces and their corresponding descriptors. We also propose an image synthesis system to render high quality fully-labelled face-and-place images which are used to train the network. To facilitate this research, we collect and annotate a dataset of real images containing celebrities in different places, which can be used to evaluate the retrieval system. We demonstrate significantly improved retrieval performance for compound queries using the new face-aware place-descriptors compared to baseline methods. Set retrieval is another example of compound query retrieval. Namely, we wish to rank the images, given a set of query identities, such that those containing all the identities of the query are ranked first, followed by those which satisfy all but one of the query identities, and so on. To this end, we propose a network architecture to achieve the objective: it learns face descriptors and their aggregation over a set to produce a compact fixed length descriptor designed for set retrieval. We also explore the speed vs. retrieval quality trade-off for set retrieval using this compact descriptor. For evaluation, we collect and annotate a large dataset of images containing various numbers of celebrities, which we use and is publicly available. Template-based face recognition, where a set of faces of the same subject is available, is now gaining attention as there are usually more than one examples for each subject in real-world situations. To tackle this problem, we propose a network architecture which aggregates and embeds the face descriptors produced by deep convolutional neural networks into a compact template representation. This compact representation requires minimal memory storage and enables efficient similarity computation. The proposed architecture contains a novel GhostVLAD layer which enables the network to deal with poor quality images, i.e. informative images contribute more than the low quality ones. We also show that such quality weighting on the input faces emerges automatically. The performance of the network far exceeds the state-of-the-art on one of the most challenging public benchmarks.
Supervisor: Zisserman, Andrew ; Arandjelović, Relja Sponsor: Engineering and Physical Sciences Research Council
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available