Use this URL to cite or link to this record in EThOS:
Title: Representation learning beyond semantic similarity : character-aware and function-specific approaches
Author: Gerz, Daniela Susanne
ISNI:       0000 0004 9348 0997
Awarding Body: University of Cambridge
Current Institution: University of Cambridge
Date of Award: 2020
Availability of Full Text:
Access from EThOS:
Full text unavailable from EThOS. Please try the link below.
Access from Institution:
Representation learning is a research area within machine learning and natural language processing (NLP) concerned with building machine-understandable representations of discrete units of text. Continuous representations are at the core of modern machine learning applications, and representation learning has thereby become one of the central research areas in NLP. The induction of text representations is typically based on the distributional hypothesis, and consequently encodes general information about word similarity. Words or phrases with similar meaning obtain similar representations in a vector space constructed for this purpose. This established methodology excels for morphologically-simple languages such as English, and in data-rich settings. However, several useful lexical relations such as entailment or selectional preference, are not captured or get conflated with other relations. Another challenge is dealing with low-data regimes for morphologically-complex and under-resourced languages. In this thesis we construct novel representation learning methods that go beyond the limitations of the distributional hypothesis and investigate solutions that induce vector spaces with diverse properties. In particular, we look at how the vector space induction process influences the contained information, and how the information manifests in a number of core NLP tasks: semantic similarity, lexical entailment, selectional preference, and language modeling. We contribute novel evaluations of state-of-the-art models highlighting their current capabilities and limitations. An analysis of language modeling in 50 typologically-diverse languages demonstrates that representations can indeed pose a performance bottleneck. We introduce a novel approach to leveraging subword-level information in word representations: our solution lifts this bottleneck in low-resource scenarios. Finally, we introduce a novel paradigm of function-specific representation learning that aims to integrate fine-grained semantic relations and real-world knowledge into the word vector spaces. We hope this thesis can serve as a valuable overview on word representations, and inspire future work in modeling 'semantic similarity and beyond'.
Supervisor: Korhonen, Anna Sponsor: ERC
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
Keywords: multilingual ; representation learning ; word vector spaces