Title:
|
Deep learning-based person re-identification
|
In this dissertation I address the problem of person re-identification for wide-area surveillance applications, in the challenging scenario of an unconstrained network of uncalibrated and non-overlapping cameras. Specifically, of all the sources of uncertainty affecting the pedestrians appearance, I focus on tackling the changing viewpoint factor across cameras. This variability factor causes the representations of different identities to interfere with each other in the feature space, to the detriment of the discrimination capability of the re-identification system. In order to deal with this problem, I propose two effective methods, relying on the representational power of deep architectures, that consist respectively in a newly designed embedding learning technique and a pose-aware regulation approach for video-sequences enabled by a generative model. My first method addresses the viewpoint problem in the context of the still images, aiming to make the most out of a fixed available amount of training data, without relying on the exploitation of any side information. I present a novel training loss for convolutional neural networks that achieves better optimization by learning a convenient embedding space. This method targets two aspects: expanding the feature space and contextually reducing the intra-class variability for all identities, without increasing the training complexity or requiring the support of samples mining techniques. I illustrate in a demo the beneficial effects of this method combined with the definition of an ad-hoc novelty threshold, for open-set re-id application. In my second method, moving towards the requirements of real-world applications, I extend my investigation on the viewpoint factor to the video-context, where samples are represented by tracklets. I apply for the first time a GAN-based generative model to video-sequences for complementing and pose-aligning the original incomplete data. To address this, I proceed in two steps. Firstly, I perform tracklets normalization with respect to a set of canonical poses that integrate the missing pose/viewpoint information by synthetic GAN-generated images. A weighted fusion scheme combines then the generated information with the original data representation. Secondly, I perform explicit pose-based alignment of sequence pairs to promote coherent feature matching, mitigating the negative effect of low inter-identities relative distance. Both my approaches compare positively to the state of the art and show significant improvement over other competing techniques on several popular public datasets.
|