Use this URL to cite or link to this record in EThOS:
Title: Towards more human-like text summarization : story abstraction using discourse structure and semantic information
Author: Droog-Hayes, Maximilian
ISNI:       0000 0004 9355 3420
Awarding Body: Queen Mary University of London
Current Institution: Queen Mary, University of London
Date of Award: 2019
Availability of Full Text:
Access from EThOS:
Access from Institution:
With the massive amount of textual data being produced every day, the ability to effectively summarise text documents is becoming increasingly important. Automatic text summarization entails the selection and generalisation of the most salient points of a text in order to produce a summary. Approaches to automatic text summarization can fall into one of two categories: abstractive or extractive approaches. Extractive approaches involve the selection and concatenation of spans of text from a given document. Research in automatic text summarization began with extractive approaches, scoring and selecting sentences based on the frequency and proximity of words. In contrast, abstractive approaches are based on a process of interpretation, semantic representation, and generalisation. This is closer to the processes that psycholinguistics tells us that humans perform when reading, remembering and summarizing. However in the sixty years since its inception, the field has largely remained focused on extractive approaches. This thesis aims to answer the following questions. Does knowledge about the discourse structure of a text aid the recognition of summary-worthy content? If so, which specific aspects of discourse structure provide the greatest benefit? Can this structural information be used to produce abstractive summaries, and are these more informative than extractive summaries? To thoroughly examine these questions, they are each considered in isolation, and as a whole, on the basis of both manual and automatic annotations of texts. Manual annotations facilitate an investigation into the upper bounds of what can be achieved by the approach described in this thesis. Results based on automatic annotations show how this same approach is impacted by the current performance of imperfect preprocessing steps, and indicate its feasibility. Extractive approaches to summarization are intrinsically limited by the surface text of the input document, in terms of both content selection and summary generation. Beginning with a motivation for moving away from these commonly used methods of producing summaries, I set out my methodology for a more human-like approach to automatic summarization which examines the benefits of using discourse-structural information. The potential benefit of this is twofold: moving away from a reliance on the wording of a text in order to detect important content, and generating concise summaries that are independent of the input text. The importance of discourse structure to signal key textual material has previously been recognised, however it has seen little applied use in the field of autovii matic summarization. A consideration of evaluation metrics also features significantly in the proposed methodology. These play a role in both preprocessing steps and in the evaluation of the final summary product. I provide evidence which indicates a disparity between the performance of coreference resolution systems as indicated by their standard evaluation metrics, and their performance in extrinsic tasks. Additionally, I point out a range of problems for the most commonly used metric, ROUGE, and suggest that at present summary evaluation should not be automated. To illustrate the general solutions proposed to the questions raised in this thesis, I use Russian Folk Tales as an example domain. This genre of text has been studied in depth and, most importantly, it has a rich narrative structure that has been recorded in detail. The rules of this formalism are suitable for the narrative structure reasoning system presented as part of this thesis. The specific discourse-structural elements considered cover the narrative structure of a text, coreference information, and the story-roles fulfilled by different characters. The proposed narrative structure reasoning system produces highlevel interpretations of a text according to the rules of a given formalism. For the example domain of Russian Folktales, a system is implemented which constructs such interpretations of a tale according to an existing set of rules and restrictions. I discuss how this process of detecting narrative structure can be transferred to other genres, and a key factor in the success of this process: how constrained are the rules of the formalism. The system enumerates all possible interpretations according to a set of constraints, meaning a less restricted rule set leads to a greater number of interpretations. For the example domain, sentence level discourse-structural annotations are then used to predict summary-worthy content. The results of this study are analysed in three parts. First, I examine the relative utility of individual discourse features and provide a qualitative discussion of these results. Second, the predictive abilities of these features are compared when they are manually annotated to when they are annotated with varying degrees of automation. Third, these results are compared to the predictive capabilities of classic extractive algorithms. I show that discourse features can be used to more accurately predict summary-worthy content than classic extractive algorithms. This holds true for automatically obtained annotations, but with a much clearer difference when using manual annotations. The classifiers learned in the prediction of summary-worthy sentences are subsequently used to inform the production of both extractive and abstractive summaries to a given length. A human-based evaluation is used to compare these summaries, as well as the outputs of a classic extractive summarizer. I analyse the impact of knowledge about discourse structure, obtained both manually and automatically, on summary production. This allows for some insight into the knock on effects on summary production that can occur from inaccurate discourse information (narrative structure and coreference information). My analyses show that even given inaccurate discourse information, the resulting abstractive summaries are considered more informative than their extractive counterparts. With human-level knowledge about discourse structure, these results are even clearer. In conclusion, this research provides a framework which can be used to detect the narrative structure of a text, and shows its potential to provide a more human-like approach to automatic summarization. I show the limit of what is achievable with this approach both when manual annotations are obtainable, and when only automatic annotations are feasible. Nevertheless, this thesis supports the suggestion that the future of summarization lies with abstractive and not extractive techniques.
Supervisor: Not available Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available