Title:
|
Towards multi-modal face recognition in the wild
|
Face recognition aims at utilizing the facial appearance for the identification or verification of human individuals, and has been one of the fundamental research areas in computer vision. Over the past a few decades, face recognition has drawn significant attention due to its potential use in biometric authentication, surveillance, security, robotics and so on. Many existing face recognition methods are evaluated with faces collected in labs, and does not generalize well in reality. Compared with faces captured in labs, faces in the wild are inherently multi-modal distributed. The multi-modality issue leads to significant intra-class variations, and usually requires a large amount of labeled samples to cover the wide range of modalities. These difficulties make unconstrained face recognition even more challenging, and pose a considerable gap between laboratorial research and industrial practice. To bridge the gap, we set focus on multi-modal face recognition in the unconstrained environment in this thesis. This thesis introduces several approaches to address the aforementioned specific challenges. Accordingly, the approaches included can be generally categorized into two research directions. The first direction explores a series of deep learning based methods in handling the large intra-class variations in multi-modal face recognition. The combination of modalities in the wild is unpredictable, and thus is difficult to explicitly define in advance. It is desirable to design a framework adaptive to the modality-driven variations in the specific scenarios. To this end, Deep Neural Network (DNN) is adopted as the basis, as DNN learns the feature representation and the classifier with reference to the specific target objective directly. To begin with, we aims to learn a part-based facial representation with deep neural networks to address face verification in the wild. In particular, the proposed framework consists of two deliberate components: a Deep Mixture Model (DMM) to find accurate patch correspondence and a Convolutional Fusion Network (CFN) to learn the fusion of multiple patch-specific facial features. This framework is specifically designed to handle local distortions caused by modalities such as pose and illumination. The next work introduces the conditional partition of the sample space into deep learning to tackle face recognition with regard to modalities in a general sense. Without any prior knowledge of modality, the proposed network learns the hidden modalities of faces, based on which the initial sample space is partitioned so that modality-specific feature representation can be learnt accordingly. The other direction is Semi-Supervised Learning with videos to tackle the deficiency of labeled training samples. In particular, a novel Semi-Supervised Learning strategy is proposed for the problem of celebrity identification by harvesting the 'confident' unlabeled samples from the vast video sources. The video context information is adopted to iteratively enrich the diversity of the initial labeled set so that the performance of learnt classifier can be gradually improved. In this thesis, all these works are evaluated with extensive experiments in the corresponding sections. The connection and difference among the three approaches are further discussed in the conclusion section.
|