Use this URL to cite or link to this record in EThOS:
Title: Describing human activities in video streams
Author: Alharbi, Nouf
ISNI:       0000 0004 6495 4022
Awarding Body: University of Sheffield
Current Institution: University of Sheffield
Date of Award: 2017
Availability of Full Text:
Access from EThOS:
Access from Institution:
This thesis outlines and advances a video description framework that describes human activities and their spatial and temporal relationships that can be used for video indexing, retrieval and summarisation applications. Generating natural language description of video streams demands the extraction of high-level features (HLFs) that sufficiently represent the events. This research centres around the issues combined with this task. One of these issues relates to identifying and segmenting participant human objects and identifying their visual attributions due to a broad range of scene setting variations, occlusion and background clutter. To that end a five-stage approach is developed to investigate the video description task. Firstly, a proper corpus that can be used for development and evaluation is created which contains relatively long video clips of human activities crafted from the Hollywood2 dataset, depicting a variety of action classes along with human textual annotations for each. Extensive analysis of the hand annotations associated with this corpus results in the conclusion that annotators are most interested in human presences and their visual attributions in the video stream, especially their actions, and interaction with other objects. Secondly, based on analysis outcome a novel framework that can detect, segment and track human body regions over video frames is proposed in order to efficiently describe video semantic. The proposed framework leverages the advances of low-level image cues and highlevel part detectors information. Thirdly, the visual attributions of extracted human objects are extracted as an efficient human action recognition framework is introduced. The video representation is improved by using extracted spatio-temporal human regions combined with the extended spatio-temporal locality-constrained linear coding (LLC) technique in order to identify the action class. Human action classification benchmarks are used to assess the performance of this model. The results reveal that the outcome of this approach outperforms the state-of-the-art, owing to its efficient representation of complex actions in video stream. Fourthly, as spatial and temporal relations of prominent objects play a vital role in describing video semantic content, a comprehensive representation is developed to efficiently extract spatial and temporal relations between interacted objects present in a video clip using their approximate oriented bounding box. The final stage aims to convert extracted HLFs into sentential descriptions using a template-based approach. By calculating the overlap between descriptions produced by machine and those annotated by humans, it can be confirmed that context information is captured by automatic descriptions, which means that these descriptions are compatible with human viewing abilities. Finally, a video retrieval task based on textual query is designed to evaluate the generated natural language descriptions. The experimental outcome shows that the approach is able to retrieve relevant video segments and capture the main aspect of video semantic.
Supervisor: Gotoh, Yoshihiko Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available