Use this URL to cite or link to this record in EThOS:
Title: Integrating text-mining approaches to identify entities and extract events from the biomedical literature
Author: Gerner, Lars Martin Anders
ISNI:       0000 0004 2719 9355
Awarding Body: University of Manchester
Current Institution: University of Manchester
Date of Award: 2012
Availability of Full Text:
Access from EThOS:
Access from Institution:
The amount of biomedical literature available is increasing at an exponential rate and is becoming increasingly difficult to navigate. Text-mining methods can potentially mitigate this problem, through the systematic and large-scale extraction of structured information from inherently unstructured biomedical text. This thesis reports the development of four text-mining systems that, by building on each other, has enabled the extraction of information about a large number of published statements in the biomedical literature. The first system, LINNAEUS, enables highly accurate detection ('recognition') and identification ('normalization') of species names in biomedical articles. Building on LINNAEUS, we implemented a range of improvements in the GNAT system, enabling high-throughput gene/protein detection and identification. Using gene/protein identification from GNAT, we developed the Gene Expression Text Miner (GETM), which extracts information about gene expression statements. Finally, building on GETM as a pilot project, we constructed the BioContext integrated event extraction system, which was used to extract information about over 11 million distinct biomolecular processes in 10.9 million abstracts and 230,000 full-text articles. The ability to detect negated statements in the BioContext system enables the preliminary analysis of potential contradictions in the biomedical literature. All tools (LINNAEUS, GNAT, GETM, and BioContext) are available under open-source software licenses, and LINNAEUS and GNAT are available as online web-services. All extracted data (36 million BioContext statements, 720,000 GETM statements, 72,000 contradictions, 37 million mentions of species names, 80 million mentions of gene names, and 57 million mentions of anatomical location names) is available for bulk download. In addition, the data extracted by GETM and BioContext is also available to biologists through easy-to-use search interfaces.
Supervisor: Bergman, Casey; Nenadic, Goran Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available
Keywords: Biomedical text mining