Use this URL to cite or link to this record in EThOS:
Title: Visual attention mechanism in deep learning and its applications
Author: Yan, Shiyang
ISNI:       0000 0004 7659 1732
Awarding Body: University of Liverpool
Current Institution: University of Liverpool
Date of Award: 2018
Availability of Full Text:
Access from EThOS:
Access from Institution:
Recently, in computer vision, a branch of machine learning, called deep learning, has attracted high attention due to its superior performance in various computer vision tasks such as image classification, object detection, semantic segmentation, action recognition and image description generation. Deep learning aims at discovering multiple levels of distributed representations, which have been validated to be discriminatively powerful in many tasks. Visual attention is an ability of the vision system to selectively focus on the salient and relevant features in a visual scene. The core objective of visual attention is to achieve the least possible amount of visual information to be processed to solve the complex high-level tasks, e.g., object recognition, which can lead the whole vision process to become effective. The visual attention is not a new topic which has been addressed in the conventional computer vision algorithms for many years. The development and deployment of visual attention in deep learning algorithms are of vital importance since the visual attention mechanism matches well with the human visual system and also shows an improving effect in many real-world applications. This thesis is on the visual attention in deep learning, starting from the recent progress in visual attention mechanism, followed by several contributions on the visual attention mechanism targeting at diverse applications in computer vision, which include the action recognition from still images, action recognition from videos and image description generation. Firstly, the soft attention mechanism, which was initially proposed to combine with Recurrent Neural Networks (RNNs), especially the Long Short-term Memories (LSTMs), was applied in image description generation. In this thesis, instead, as one contribution to the visual attention mechanism, the soft attention mechanism is proposed to directly plug into the convolutional neural networks for the task of action recognition from still images. Specifically, a multi-branch attention network is proposed to capture the object that the human is intereating with and the scene in which the action is performing. The soft attention mechanism applying in this task plays a significant role in capturing multi-type contextual information during recognition. Also, the proposed model can be applied in two experimental settings: with and without the bounding box of the person. The experimental results show that the proposed networks achieved state-of-the-art performance on several benchmark datasets. For the action recognition from videos, our contribution is twofold: firstly, the hard attention mechanism, which selects a single part of features during recognition, is essentially a discrete unit in a neural network. This hard attention mechanism shows superior capacity in discriminating the critical information/features for the task of action recognition from videos, but is often with high variance during training, as it employs the REINFORCE algorithm as its gradient estimator. Hence, this brought another critical research question, i.e., the gradient estimation of the discrete unit in a neural network. In this thesis, a Gumbel-softmax gradient estimator is applied to achieve this goal, with much lower variance and more stable training. Secondly, to learn a hierarchical and multi-scale structure for the multi-layer RNN model, we embed discrete gates to control the information between each layer of the RNNs. To make the model differentiable, instead of using the REINFORCE-like algorithm, we propose to use Gumbel-sigmoid to estimate the gradient of these discrete gates. For the task of image captioning, there are two main contributions in this thesis: primarily, the visual attention mechanism can not only be used to reason on the global image features but also plays a vital role in the selection of relevant features from the fine-grained objects appear in the image. To form a more comprehensive image representation, as a contribution to the encoder network for image captioning, a new hierarchical attention network is proposed to fuse the global image and local object features through the construction of a hierarchical attention structure, to better the visual representation for the image captioning. Secondly, to solve an inherent problem called exposure-biased issue of the RNN-based language decoder commonly used in image captioning, instead of only relying on the supervised training scheme, an adversarial training-based policy gradient optimisation algorithm is proposed to train the networks for image captioning, with improved results on the evaluation metrics. In conclusion, comprehensive research has been carried out for the visual attention mechanism in deep learning and its applications, which include action recognition and image description generation. Related research topics have also been discussed, for example, the gradient estimation of the discrete units and the solution to the exposure-biased issue in the RNN-based language decoder. For the action recognition and image captioning, this thesis presents several contributions which proved to be effective in improving existing methods.
Supervisor: Zhang, Bailing Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral