Use this URL to cite or link to this record in EThOS:
Title: Automating the construction of higher order data representations from heterogeneous biodiversity datasets
Author: Nicolson, Nicky
ISNI:       0000 0004 8500 3365
Awarding Body: Brunel University London
Current Institution: Brunel University
Date of Award: 2019
Availability of Full Text:
Access from EThOS:
Access from Institution:
Datasets created from large-scale specimen digitisation drive biodiversity research, but these are often heterogeneous: incomplete and fragmented. As aggregated data volumes increase, there have been calls to develop a "biodiversity knowledge graph" to better interconnect the data and support meta-analysis, particularly relating to the process of species description. This work maps data concepts and inter-relationships, and aims to develop automated approaches to detect the entities required to support these kinds of meta-analyses. An example is given using trends analysis on name publication events and their authors, which shows that despite implementation and widespread adoption of major changes to the process by which authors can publish new scientific names for plants, the data show no difference in the rates of publication. A novel data-mining process based on unsupervised learning is described, which detects specimen collectors and events preparatory to species description, allowing a larger set of data to be used in trends analysis. Record linkage techniques are applied to these two datasets to integrate data on authors and collectors to create a generalised agent entity, assessing specialisation and classifying working practices into separate categories. Recognising the role of agents (collectors, authors) in the processes (collection, publication) contributing to the recognition of new species, it is shown that features derived from data-mined aggregations can be used to build a classification model to predict which agent-initiated units of work are particularly valuable for species discovery. Finally, shared collector entities are used to integrate distributed specimen products of a single collection event across institutional boundaries, maximising impact of expert annotations. An inferred network of relationships between institutions based on specimen sharing relationships allows community analysis and the definition of optimal co-working relationships for efficient specimen digitisation and curation.
Supervisor: Tucker, A. Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available
Keywords: Machine learning ; Biodiversity informatics ; Specimen digitisation ; Clustering ; Record linkage