Use this URL to cite or link to this record in EThOS:
Title: Explaining deep neural networks
Author: Camburu, Oana-Maria
ISNI:       0000 0004 9355 6234
Awarding Body: University of Oxford
Current Institution: University of Oxford
Date of Award: 2020
Availability of Full Text:
Access from EThOS:
Full text unavailable from EThOS. Please try the link below.
Access from Institution:
Deep neural networks are becoming more and more popular due to their revolutionary success in diverse areas, such as computer vision, natural language processing, and speech recognition. However, the decision-making processes of these models are generally not interpretable to users. In various domains, such as healthcare, finance, or law, it is critical to know the reasons behind a decision made by an artificial intelligence system. Therefore, several directions for explaining neural models have recently been explored. In this thesis, I investigate two major directions for explaining deep neural networks. The first direction consists of feature-based post-hoc explanatory methods, that is, methods that aim to explain an already trained and fixed model (post-hoc), and that provide explanations in terms of input features, such as tokens for text and superpixels for images (feature-based). The second direction consists of self-explanatory neural models that generate natural language explanations, that is, models that have a built-in module that generates explanations for the predictions of the model. The contributions in these directions are as follows. First, I reveal certain difficulties of explaining even trivial models using only input features. I show that, despite the apparent implicit assumption that explanatory methods should look for one specific ground-truth feature-based explanation, there is often more than one such explanation for a prediction. I also show that two prevalent classes of explanatory methods target different types of ground-truth explanations without explicitly mentioning it. Moreover, I show that, sometimes, neither of these explanations is enough to provide a complete view of a decision-making process on an instance. Second, I introduce a framework for automatically verifying the faithfulness with which feature-based post-hoc explanatory methods describe the decision-making processes of the models that they aim to explain. This framework relies on the use of a particular type of model that is expected to provide insight into its decision-making process. I analyse potential limitations of this approach and introduce ways to alleviate them. The introduced verification framework is generic and can be instantiated on different tasks and domains to provide off-the-shelf sanity tests that can be used to test feature-based post-hoc explanatory methods. I instantiate this framework on a task of sentiment analysis and provide sanity tests on which I present the performances of three popular explanatory methods. Third, to explore the direction of self-explanatory neural models that generate natural language explanations for their predictions, I collected a large dataset of ~570K human-written natural language explanations on top of the influential Stanford Natural Language Inference (SNLI) dataset. I call this explanation-augmented dataset e-SNLI. I do a series of experiments that investigate both the capabilities of neural models to generate correct natural language explanations at test time, and the benefits of providing natural language explanations at training time. Fourth, I show that current self-explanatory models that generate natural language explanations for their own predictions may generate inconsistent explanations, such as ``There is a dog in the image.'' and ``There is no dog in the [same] image.''. Inconsistent explanations reveal either that the explanations are not faithfully describing the decision-making process of the model or that the model learned a flawed decision-making process. I introduce a simple yet effective adversarial framework for sanity checking models against the generation of inconsistent natural language explanations. Moreover, as part of the framework, I address the problem of adversarial attacks with exact target sequences, a scenario that was not previously addressed in sequence-to-sequence attacks, and which can be useful for other tasks in natural language processing. I apply the framework on a state of the art neural model on e-SNLI and show that this model can generate a significant number of inconsistencies. This work paves the way for obtaining more robust neural models accompanied by faithful explanations for their predictions.
Supervisor: Gyrd-Hansen, Mads ; Nijman, Sebastian ; Brennan, Paul Edward Sponsor: JP Morgan PhD Fellowship ; Engineering and Physical Sciences Research Council
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available
Keywords: natural language processing ; explainability ; deep learning