Use this URL to cite or link to this record in EThOS:
Title: Neural models for stepwise text illustration
Author: Batra, Vishwash
ISNI:       0000 0004 9358 3144
Awarding Body: University of Warwick
Current Institution: University of Warwick
Date of Award: 2020
Availability of Full Text:
Access from EThOS:
Access from Institution:
In this thesis, we investigate the task of sequence-to-sequence (seq2seq) retrieval: given a sequence (of text passages) as the query, retrieve a sequence (of images) that best describes and aligns with the query. This is a step beyond the traditional cross-modal retrieval which treats each image-text pair independently and ignores broader context. Since this is a difficult task, we break it into steps. We start with caption generation for images in news articles. Different from traditional image captioning task where a text description is generated given an image, here, a caption is generated conditional on both image and the news articles where it appears. We propose a novel neural-networks based methodology to take into account both news article content and image semantics to generate a caption best describing the image and its surrounding text context. Our results outperform existing approaches to image captioning generation. We then introduce two new novel datasets, GutenStories and Stepwise Recipe datasets for the task of story picturing and sequential text illustration. GutenStories consists of around 90k text paragraphs, each accompanied with an image, aligned in around 18k visual stories. It consists of a wide variety of images and story content styles. StepwiseRecipe is a similar dataset having sequenced image-text pairs, but having only domain-constrained images, namely food-related. It consists of 67k text paragraphs (cooking instructions), each accompanied by an image describing the step, aligned in 10k recipes. Both datasets are web-scrawled and systematically filtered and cleaned. We propose a novel variational recurrent seq2seq (VRSS) retrieval model. xii The model encodes two streams of information at every step: the contextual information from both text and images retrieved in previous steps, and the semantic meaning of the current input (text) as a latent vector. These together guide the retrieval of a relevant image from the repository to match the semantics of the given text. The model has been evaluated on both the Stepwise Recipe and GutenStories datasets. The results on several automatic evaluation measures show that our model outperforms several competitive and relevant baselines. We also qualitatively analyse the model both using human evaluation and by visualizing the representation space to judge the semantical meaningfulness. We further discuss the challenges faced on the more difficult GutenStories and outline possible solutions.
Supervisor: Not available Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available
Keywords: QA76 Electronic computers. Computer science. Computer software