Use this URL to cite or link to this record in EThOS:
Title: Automated analysis and validation of chemical literature
Author: Townsend, Joseph A.
ISNI:       0000 0004 2707 5678
Awarding Body: University of Cambridge
Current Institution: University of Cambridge
Date of Award: 2008
Availability of Full Text:
Access from EThOS:
Full text unavailable from EThOS. Please try the link below.
Access from Institution:
Methods to automatically extract and validate data from the chemical literature in legacy formats to machine-understandable forms are examined. The work focuses of three types of data: analytical data reported in articles, computational chemistry output files and crystallographic information files (CIFs). It is shown that machines are capable of reading and extracting analytical data from the current legacy formats with high recall and precision. Regular expressions cannot identify chemical names with high precision or recall but non-deterministic methods perform significantly better. The lack of machine-understandable connection tables in the literature has been identified as the major issue preventing molecule-based data-driven science being performed in the area. The extraction of data from computational chemistry output files using parser-like approaches is shown to be not generally possible although such methods work well for input files. A hierarchical regular expression based approach can parse > 99.9% of the output files correctly although significant human input is required to prepare the templates. CIFs may be parsed with extremely high recall and precision, contain connection tables and the data is of high quality. The comparison of bond lengths calculated by two computational chemistry programs show good agreement in general but structures containing specific moieties cause discrepancies. An initial protocol for the high-throughput geometry optimisation of molecules extracted from the CIFs is presented and the refinement of this protocol is discussed. Differences in bond length between calculated and experimentally determined values from the CIFs of less than 0.03 Angstrom are shown to be expected by random error. The final protocol is used to find high-quality structures from crystallography which can be reused for further science.
Supervisor: Not available Sponsor: Unilever Centre for Molecular Science Informatics
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
Keywords: MOPAC ; GAMESS ; CIF ; Calculation ; OSCAR ; Regular Expression