Use this URL to cite or link to this record in EThOS:
Title: Learning visual recognition of fine-grained object categories from textual descriptions
Author: Wang, Josiah Kwok-Siang
Awarding Body: University of Leeds
Current Institution: University of Leeds
Date of Award: 2013
Availability of Full Text:
Access from EThOS:
This thesis investigates the task of learning visual object category recognition from textual descriptions. The work contributes primarily to the recognition of fine-grained object categories, such as animal and plant species, where it may be difficult to collect many images for training. but where textual descriptions are readily available, for example from online nature guides. The idea of using textual descriptions for fine-grained object category recognition is explored in three separate but related tasks. The first is the task of learning recognition of object categories solely from textual descriptions; no category-specific training images are used. Our proposed framework comprises three components: (i) natural language processing to build object category models from textual descriptions; (ii) visual processing to extract visual attributes from test images; (Hi) generative model connecting textual terms and visual attributes from images. As an 'upper-bound' we also evaluate how well humans perform in a similar task. The proposed method was evaluated on a butterfly dataset as an example, performing substantially better than chance, and interestingly comparable to the performance of non-native English speakers. The second task is an extension to the first. Here we focus on the problem of learning models for attribute terms (e.g. "orange bands"), from a set of training classes disjoint from the test classes. Attribute models are learnt independently for each attribute term in a weakly supervised fashion from textual descriptions, and are used in conjunction with textual descriptions of the test classes to build probabilistic models for object category recognition. A modest accuracy was achieved with our method when evaluated on a butterfly dataset, although performance was substantially improved with some human supervision to combine similar attribute terms. The third task explores how textual descriptions can be used to automatically harvest training images for each object category. Starting with just the category name, a textual description and no example images, web pages are gathered from search engines, and images filtered based on how similar their surrounding texts are to the given textual description. The idea is that images in close proximity to texts that are similar to the textual description are more likely to be example images of the desired category. The proposed method is demonstrated for a set of butterfly categories. where images were successfully re-ranked based on their corresponding text blocks alone, with many categories achieving higher precision than their baselines at early stages of recall. The proposed approaches of exploiting textual descriptions, although still in their infancy, shows potential for visual object recognition tasks, effectively reducing the amount of human supervision required for annotating images.
Supervisor: Not available Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available