Use this URL to cite or link to this record in EThOS:
Title: Analytic knowledge discovery techniques for ad-hoc information retrieval and automatic text summarization
Author: Goyal, Pawan
Awarding Body: University of Ulster
Current Institution: Ulster University
Date of Award: 2011
Availability of Full Text:
Access through EThOS:
Information retrieval is broadly concerned with the problem of automated searching for information within some document repository to support various information requests by users. The traditional retrieval frameworks work on the simplistic assumptions of “word independence” and “bag-of-words”, giving rise to problems such as “term mismatch” and “context independent document indexing”. Automatic text summarization systems, which use the same paradigm as that of information retrieval, also suffer from these problems. The concept of “semantic relevance” has also not been formulated in the existing literature. This thesis presents a detailed investigation of the knowledge discovery models and proposes new approaches to address these issues. The traditional retrieval frameworks do not succeed in defining the document content fully because they do not process the concepts in the documents; only the words are processed. To address this issue, a document retrieval model has been proposed using concept hierarchies, learnt automatically from a corpora. A novel approach to give a meaningful representation to the concept nodes in a learnt hierarchy has been proposed using a fuzzy logic based soft least upper bound method. A novel approach of adapting the vector space model with dependency parse relations for information retrieval also has been developed. A user query for information retrieval (IR) applications may not contain the most appropriate terms (words) as actually intended by the user. This is usually referred to as the term mismatch problem and is a crucial research issue in IR. To address this issue, a theoretical framework for Query Representation (QR) has been developed through a comprehensive theoretical analysis of a parametric query vector. A lexical association function has been derived analytically using the relevance criteria. The proposed QR model expands the user query using this association function. A novel term association metric has been derived using the Bernoulli model of randomness. x The derived metric has been used to develop a Bernoulli Query Expansion (BQE) model. The Bernoulli model of randomness has also been extended to the pseudo relevance feedback problem by proposing a Bernoulli Pseudo Relevance (BPR) model. In the traditional retrieval frameworks, the context in which a term occurs is mostly overlooked in assigning its indexing weight. This results in context independent document indexing. To address this issue, a novel Neighborhood Based Document Smoothing (NBDS) model has been proposed, which uses the lexical association between terms to provide a context sensitive indexing weight to the document terms, i.e. the term weights are redistributed based on the lexical association with the context words. To address the “context independent document indexing” for sentence extraction based text summarization task, a lexical association measure derived using the Bernoulli model of randomness has been used. A new approach using the lexical association between terms has been proposed to give a context sensitive weight to the document terms and these weights have been used for the sentence extraction task. Developed analytically, the proposed QR, BQE, BPR and NBDS models provide a proper mathematical framework for query expansion and document smoothing techniques, which have largely been heuristic in the existing literature. Being developed in the generalized retrieval framework, as also proposed in this thesis, these models are applicable to all of the retrieval frameworks. These models have been empirically evaluated over the benchmark TREC datasets and have been shown to provide significantly better performance than the baseline retrieval frameworks to a large degree, without adding significant computational or storage burden. The Bernoulli model applied to the sentence extraction task has also been shown to enhance the performance of the baseline text summarization systems over the benchmark DUC datasets. The theoretical foundations alongwith the empirical results verify that the proposed knowledge discovery models in this thesis advance the state of the art in the field of information retrieval and automatic text summarization.
Supervisor: Not available Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available