Use this URL to cite or link to this record in EThOS:
Title: Towards a natural language processing pipeline and search engine for biomedical associations derived from scientific literature
Author: Galea, Dieter
ISNI:       0000 0004 9356 9799
Awarding Body: Imperial College London
Current Institution: Imperial College London
Date of Award: 2019
Availability of Full Text:
Access from EThOS:
Access from Institution:
Biomedical research is published at a rapid rate, with PubMed containing over 29 million publications. A natural language processing pipeline (NLP) facilitating information extraction is required. Existing pipelines achieve promising performance, but are often restricted to a small number of bioentities (such as genes and diseases), ignore negative associations, and treat new claims and background sentences equally. Here, different NLP tasks required to develop a scalable and generalizable open source pipeline for biomedical association extraction that tackles these limitations are investigated. In turn, this is used to build a repository of queryable associations. Starting by optimizing how biomedical language is represented in machine learning (ML) models, state-of-the-art representations are obtained and subsequently used in downstream tasks, including bioentity recognition. Latter work indicates that current recognition models are poorly generalizable, resulting in unrealistic performance when applied at scale. Additionally, it is shown here that acquiring more data does not improve ML-based entity recognition performance. Beyond ML methods, this work presents a number of dictionarybased approaches and graph-based dictionaries for more than 13 sources covering metabolites, genes/proteins, species, chemicals, toxins, drugs, diseases, foods, food compounds and anatomy are compiled. These are used to annotate PubMed for subsequent association extraction. To achieve a diverse association extraction pipeline for 10 entity types, we attempt to find a balance between generalizable rules and ML models. A neural model is trained to identify novel association claims with 94% accuracy and a rule-based approach to identify negated statements with up to 91% accuracy. A set of rules are devised to define associations. Quantitative evaluation shows promising results, however further work is required. Extracted associations are stored in a graph database, enabling querying for associations reported in literature, as well as discovering new potential indirect linkages. To demonstrate its future use, a frontend proof of concept is presented.
Supervisor: Veselkov, Kirill ; Takats, Zoltan Sponsor: Imperial College London
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral