Title:
|
Analytic knowledge discovery techniques for ad-hoc information retrieval and automatic text summarization
|
Information retrieval is broadly concerned with the problem of automated searching
for information within some document repository to support various information
requests by users. The traditional retrieval frameworks work on the simplistic
assumptions of “word independence” and “bag-of-words”, giving rise to problems
such as “term mismatch” and “context independent document indexing”. Automatic
text summarization systems, which use the same paradigm as that of information
retrieval, also suffer from these problems. The concept of “semantic relevance” has
also not been formulated in the existing literature. This thesis presents a detailed
investigation of the knowledge discovery models and proposes new approaches to
address these issues.
The traditional retrieval frameworks do not succeed in defining the document content
fully because they do not process the concepts in the documents; only the words are
processed. To address this issue, a document retrieval model has been proposed using
concept hierarchies, learnt automatically from a corpora. A novel approach to give a
meaningful representation to the concept nodes in a learnt hierarchy has been proposed
using a fuzzy logic based soft least upper bound method. A novel approach of adapting
the vector space model with dependency parse relations for information retrieval also
has been developed.
A user query for information retrieval (IR) applications may not contain the most
appropriate terms (words) as actually intended by the user. This is usually referred
to as the term mismatch problem and is a crucial research issue in IR. To address
this issue, a theoretical framework for Query Representation (QR) has been developed
through a comprehensive theoretical analysis of a parametric query vector. A lexical
association function has been derived analytically using the relevance criteria. The
proposed QR model expands the user query using this association function. A novel
term association metric has been derived using the Bernoulli model of randomness.
x
The derived metric has been used to develop a Bernoulli Query Expansion (BQE)
model. The Bernoulli model of randomness has also been extended to the pseudo
relevance feedback problem by proposing a Bernoulli Pseudo Relevance (BPR) model.
In the traditional retrieval frameworks, the context in which a term occurs is mostly
overlooked in assigning its indexing weight. This results in context independent
document indexing. To address this issue, a novel Neighborhood Based Document
Smoothing (NBDS) model has been proposed, which uses the lexical association
between terms to provide a context sensitive indexing weight to the document terms,
i.e. the term weights are redistributed based on the lexical association with the context
words.
To address the “context independent document indexing” for sentence extraction based
text summarization task, a lexical association measure derived using the Bernoulli
model of randomness has been used. A new approach using the lexical association
between terms has been proposed to give a context sensitive weight to the document
terms and these weights have been used for the sentence extraction task.
Developed analytically, the proposed QR, BQE, BPR and NBDS models provide
a proper mathematical framework for query expansion and document smoothing
techniques, which have largely been heuristic in the existing literature. Being
developed in the generalized retrieval framework, as also proposed in this thesis,
these models are applicable to all of the retrieval frameworks. These models have
been empirically evaluated over the benchmark TREC datasets and have been shown
to provide significantly better performance than the baseline retrieval frameworks
to a large degree, without adding significant computational or storage burden. The
Bernoulli model applied to the sentence extraction task has also been shown to enhance
the performance of the baseline text summarization systems over the benchmark DUC
datasets. The theoretical foundations alongwith the empirical results verify that the
proposed knowledge discovery models in this thesis advance the state of the art in the
field of information retrieval and automatic text summarization.
|