Use this URL to cite or link to this record in EThOS: https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.772272
Title: Deep learning-based regional image caption generation with refined descriptions
Author: Kinghorn, Phil
ISNI:       0000 0004 7959 7603
Awarding Body: Northumbria University
Current Institution: Northumbria University
Date of Award: 2017
Availability of Full Text:
Access from EThOS:
Full text unavailable from EThOS. Please try the link below.
Access from Institution:
Abstract:
Image captioning in recent research generally focuses upon small, relatively high-level captions. These captions are generally without detail, or insight. Missing out information which we as humans could easily, and would generally, report. This restricts the usefulness of existing systems within real-world applications. Within this thesis, we propose the following solutions to address these problems. The first stage proposes a region-based approach, focusing upon regions within images and describing them with attributes. These attributes add more meaning to standard classification labels. Improving the classification label, 'dog', produced by existing systems, to the more detailed label 'white spotted dog'. This adds a large degree of detail when used within template-based description generation. The area of healthcare is also explored in which the system is paired with a visual agent. The agent can describe the environment and report potential hazards, as well as socialising through conversation. The second stage improves upon the previous architecture, by proposing another region-based architecture which removes the rigidity of templates. Instead sentences are generated through a Recurrent Neural Network. Training this architecture on multiple smaller datasets allows for a quicker training stage, with less computing power required during both training and testing. An encoder-decoder structure is proposed to translate the detailed region labels into full image descriptions. This produces natural sounding descriptive phrases that accurately depict the contents of an image. The third stage proposes a hierarchically trained, end-to-end style system to generate an image description with the same required functionality to describe detections in detail but without the need for multiple models. This system can utilise the humanoid robot's vision and voice synthesis capabilities. Overall, the above proposed systems within this research outperform many state-of-the-art methods for the refined image description generation task, especially with complex and out-of-domain images, such as images of paintings.
Supervisor: Zhang, Li Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID: uk.bl.ethos.772272  DOI: Not available
Keywords: G400 Computer Science
Share: