Use this URL to cite or link to this record in EThOS:
Title: Numeracy of language models : joint modelling of words and numbers
Author: Spithourakis, Georgios P.
ISNI:       0000 0004 7965 1382
Awarding Body: UCL (University College London)
Current Institution: University College London (University of London)
Date of Award: 2019
Availability of Full Text:
Access from EThOS:
Full text unavailable from EThOS. Please try the link below.
Access from Institution:
Numeracy and literacy are the abilities to understand and work with numbers and words, respectively. While both skills are necessary for reading and writing documents in clinical, scientific, and other technical domains, existing statistical language models focus on words to the expense of numbers: numbers are ignored, masked, or treated similarly to words, which can obscure numerical content and cause sparsity issues, e.g. high out-of-vocabulary rates. In this thesis, we investigate whether the performance of neural language models can be improved by i) considering numerical information as additional inputs and ii) explicitly modelling the output of numerical tokens. In experiments with numbers as input, we find that numerical input features improve perplexity by 33% on a clinical dataset. In assisted text entry and verification tasks, numerical input features improve recall from 25.03% to 71.28% for word prediction with a list of 5 suggestions, keystroke savings from 34.35% to 44.81% for word completion, and F1 metric by 5 points for semantic error correction. Numerical information from an accompanying knowledge base helps improve performance further. In experiments with numerical tokens as output, we consider different strategies, e.g. memorisation and digit-by-digit composition, and propose a novel neural component based on Gaussian mixture density estimation. We propose the use of regression metrics to evaluate numerical accuracy and an adjusted perplexity metric that accounts for the high out-of-vocabulary rate of numerals. Our evaluation on clinical and scientific datasets shows that perplexity can be improved by more than 2 and 4 orders of magnitude, respectively, by modelling words and numerals with different sub-models through a hierarchical softmax. For the same datasets, our proposed mixture of Gaussians model achieved a 32% and 54% reduction of mean average percentage errors over the contender strategy, digit-by-digit composition. We conclude with a critical reflection of this thesis and suggestions for future work.
Supervisor: Not available Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available