Using document clustering and language modelling in mediated information retrieval
Our work addresses a well documented problem: users are frequently unable to articulate a query that clearly and comprehensively expresses their information need. This can be attributed to the information need being too ambiguous and not clearly defined in the user's mind, to a lack of knowledge of the domain of interest on the part of the user, to a lack of understanding of a retrieval system's conceptual model, or to an inability to use a certain query syntax. This thesis proposes a software tool that emulates the human search mediator. It helps a user explore a domain of interest, learn its structure, terminology and key concepts, and clarify and refine an information need. It can also help a user generate high-quality queries for searching the World Wide Web or other such large and heterogeneous document collections. Our work was inspired by library studies which have highlighted the role of the librarian in helping the user explore her information need, define the problem to be solved, articulate a formulation of the information need and adapt it for the retrieval system at hand in order to get information. Our approach, mediated access through a clustered collection, is based on an information access environment in which the user can explore a relatively small, well structured, pre-clustered document collection covering a particular subject domain, in order to understand the concepts encompassed and to clarify and refine her information need. At the same time, the user can ostensively indicate clusters and documents of interest so that the system builds a model of the user's topic of interest. Based on this model, the system assists and guides the user's exploration, or generates `mediated queries' that can be used to search other collections. We present the design and evaluation of WebCluster, a system that reifies the concept of mediated retrieval. Additionally, a variety of mediation experiments are presented,which provide guidelines as to which mediation strategies are more appropriate for different types of tasks. A set of experiments is presented that evaluate document clustering's capacity to group together topical documents and support mediation. In this context we propose and experimentally test a new formulation for the cluster hypothesis. We also look at the ability of language models to convey content, to represent topics and to highlight specific concepts in a given context. They are also successfully applied to generate flexible, task-dependent cluster representatives for supporting exploration through browsing and respectively searching. Our experimental results show that mediation has potential to significantly improve user queries and consequently the retrieval effectiveness.