Use this URL to cite or link to this record in EThOS:
Title: Feature extraction and encoding for video action recognition
Author: Zuo, Zheming
ISNI:       0000 0004 9354 8541
Awarding Body: Northumbria University
Current Institution: Northumbria University
Date of Award: 2020
Availability of Full Text:
Access from EThOS:
Full text unavailable from EThOS. Please try the link below.
Access from Institution:
Video action recognition, including Third-person Action Recognition (TAR) and Egocentric Action Recognition (EAR), is one of the essential tasks within the realm of computer vision. It can be regarded as the capability of determining whether a given human action occurs in the video or not. Albeit advances made by machine and deep learning techniques significantly improve the performance of action classification, some open questions are far less than comprehensively resolved, such as (1) uncertainty management in feature extraction; (2) estimated attention might not be concordant with the subjective in feature extraction; (3) dimensions may blow up during feature encoding, making visual systems with low practicalities and applicabilities. This PhD project, in the first part, presents a Histogram of Fuzzy Local Saptio-Temporal Descriptors (HFLSTD) to support uncertainty management in extracting a conventional gradient-based local feature descriptor via estimating the contribution of each pixel towards each angular-based bin adaptively controlled by a penalty parameter. The efficiency and efficacy of the HFLSTD have been confirmed by the domain benchmarks of two large scale data sets with competitive performance yielded even in comparison to some recently proposed deep feature descriptors. Then, in order to extract feature descriptors from more informative 3-D attentional regions, the Gaze-Informed Descriptors (GD) are sparsely devised by utilising human eye fixation in conjunction with estimated attention to inform the process of generating a 3-D region of interest, and hence help to extract more informative visual feature descriptors in the context of the EAR. The Saliency Descriptors (SD), ), on which the membership is based, are also developed in a dense manner for the situation where the human eye fixation information is not available. The effectiveness of GD and SD in enhancing the classification performance is demonstrated through not only a collected EAR data set but also a real-time memory aid system for Dementia and Parkinson’s patients to support health care. In addition, in the second part of this work, the Saliency-Informed Spatio-Temporal Vector of Locally Aggregated Descriptor and Fisher Vector (SST-VLAD and SST-FV) are developed to address the inherent redundancy of not only video action data sets but also extracted feature descriptors by mitigating the curse of dimensionality in the super-vector-based encoding schemes. This is contributed to by a tentative proposition of selecting the minimum number of videos from the data set, thereby a small portion of feature descriptors via the ranked video-wise saliency-based spatio-temporal scores, which in turn guide the process of codebook generation. Extensive experimental results identified that SST-VLAD and SST-FV have much lower space- and time-complexity and relative higher action classification performance, in contrast with VLAD and FV, on one TAR and one EAR data set.
Supervisor: Yang, Longzhi ; Wei, Bo Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available
Keywords: G400 Computer Science