Use this URL to cite or link to this record in EThOS:
Title: Topical subcategory structure in text classification
Author: Lyra, Risto Matti Juhani
ISNI:       0000 0004 7657 6452
Awarding Body: University of Sussex
Current Institution: University of Sussex
Date of Award: 2019
Availability of Full Text:
Access from EThOS:
Full text unavailable from EThOS. Please try the link below.
Access from Institution:
Data sets with rich topical structure are common in many real world text classification tasks. A single data set often contains a wide variety of topics and, in a typical task, documents belonging to each class are dispersed across many of the topics. Often, a complex relationship exists between the topic a document discusses and the class label: positive or negative sentiment is expressed in documents from many different topics, but knowing the topic does not necessarily help in determining the sentiment label. We know from tasks such as Domain Adaptation that sentiment is expressed in different ways under different topics. Topical context can in some cases even reverse the sentiment polarity of words: to be sharp is a good quality for knives but bad for singers. This property can be found in many different document classification tasks. Standard document classification algorithms do not account for or take advantage of topical diversity; instead, classifiers are usually trained with the tacit assumption that topical diversity does not play a role. This thesis is focused on the interplay between the topical structure of corpora, how the target labels in a classification task distribute over the topics and how the topical structure can be utilised in building ensemble models for text classification. We show empirically that a dataset with rich topical structure can be problematic for single classifiers, and we develop two novel ensemble models to address the issues. We focus on two document classification tasks: document level sentiment analysis of product reviews and hierarchical categorisation of news text. For each task we develop a novel ensemble method that utilises topic models to address the shortcomings of traditional text classification algorithms. Our contribution is in showing empirically that the class association of document features is topic dependent. We show that using the topical context of documents for building ensembles is beneficial for some tasks, and present two new ensemble models for document classification. We also provide a fresh viewpoint for reasoning about the relationship of class labels, topical categories and document features.
Supervisor: Not available Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available
Keywords: Q0337.5 Pattern recognition systems ; QA076.9.D343 Data mining