Use this URL to cite or link to this record in EThOS:
Title: Improving information retrieval for the biomedical sciences : a case study on mouse knockout literature
Author: Farhan, Reyhood
ISNI:       0000 0001 3457 3163
Awarding Body: University of Manchester
Current Institution: University of Manchester
Date of Award: 2008
Availability of Full Text:
Access from EThOS:
Given the growth in accessible online research literature it is vital that biologists are enabled to find articles with relevant information to support their investigations. This thesis details research aimed at improving the sensitivity of information retrieval to topics which are contained in articles that have a different overall topic. Fetal growth and development (FGD) information contained in articles reporting mouse gene knockout studies was used as the exemplar domain throughout. The research began by assessing the characteristics of the mouse knockout and FGD domains. Two leading domain experts were consulted to develop a definition ofFGD information that could be used as the basis for experimentation. 4 other domain experts then annotated FGD information on a sample set of 20 articles according to the provided definition to verify its presence in the mouse knockout literature and to elucidate topics and words associated with the FGD information. Subsequently, numerical and lexical methods were assessed in separate tests for the purpose of detecting FGD topics in the mouse knockout articles. The numerical methods could not be made to differentiate between the various topics, while the lexical methods were impractical to implement to describe the niche FGD domain. Different passage retrieval methods based on the Extended Boolean model were then investigated for their sensitivity to FGD information within the articles. Passages were defined as either whole articles or the titled sections appearing within articles or as 100-word non-overlapping windows. For each definition, articles were ranked according to their highest scoring passage in answer to a given query. An experimental methodology was implemented to test each passage retrieval method's relative sensitivity to FGD information based on the judgement of domain experts, quantified using the Discounted Cumulative Gain measure. The retrieval method that uses titled sections as passages performed best in 4 ofthe 5 assessed queries. An experimental web-based information retrieval system was developed, based around a relational database which stored a custom corpus of20715 mouse knockout HTML articles, as defined by the MEDLINE database. An automated heuristical procedure to identify passages from sections from HTML articles was developed, which was effective on scientific articles from a large variety of publishers. An examination of the corpus was also undertaken to elicit the domain's characteristics for use in future IR methodologies.
Supervisor: Not available Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available