Use this URL to cite or link to this record in EThOS:
Title: Features and methods for improving large scale face recognition
Author: Parkhi, Omkar Moreshwar
ISNI:       0000 0004 6062 3314
Awarding Body: University of Oxford
Current Institution: University of Oxford
Date of Award: 2015
Availability of Full Text:
Access from EThOS:
Full text unavailable from EThOS. Please try the link below.
Access from Institution:
This thesis investigates vector representations for face recognition, and uses these representations for a number of tasks in image and video datasets. First, we look at different representations for faces in images and videos. The objective is to learn compact yet effective representations for describing faces. We first investigate the use of "Fisher Vector" descriptors for this task. We show that these descriptors are perfectly suited for face representation tasks. We also investigate various approaches to effectively reduce their dimension while improving their performance further. These "Fisher Vector" features are also amenable to extreme compression and work equally well when compressed by over 2000 times as compared to their non compressed counterparts. These features achieved the state-of-the-art results on challenging public benchmarks until the re-introduction of Convolution Neural Networks (CNNs) in the community. Second, we investigate the use of "Very Deep" architectures for face representation tasks. For training these networks, we collected one of the largest annotated public datasets of celebrity faces with minimum manual intervention. We bring out specific details of these network architectures and their training objective functions essential to their performance and achieve state-of-art result on challenging datasets. Having developed these representation, we propose a method for labeling faces in the challenging environment of broadcast videos using their associated textual data, such as subtitles and transcripts. We show that our CNN representation is well suited for this task. We also propose a scheme to automatically differentiate the primary cast of a TV serial or movie from that of the background characters. We modify existing methods of collecting supervision from textual data and show that the careful alignment of video and textual data results in significant improvement in the amount of training data collected automatically, which has a direct positive impact on the performance of labeling mechanisms. We provide extensive evaluations on different benchmark datasets achieving, again, state-of-the-art results. Further we show that both the shallow as well the deep methods have excellent capabilities in switching modalities from photos to paintings and vice-a-versa. We propose a system to retrieve paintings for similar looking people given a picture and investigate the use of facial attributes for this task. Finally, we show that an on-the-fly real time search system can be built to search through thousands of hours of video data starting from a text query. We propose product quantization schemes for making face representations memory efficient. We also present the demo system based on this design for the British Broadcasting Corporation (BBC) to search through their archive. All of these contributions have been designed with a keen eye on their application in the real world. As a result, most of chapters have an associated code release and a working online demonstration.
Supervisor: Vedaldi, Andrea ; Zisserman, Andrew Sponsor: EU ; IARPA ; Engineering and Physical Sciences Research Council
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available