Use this URL to cite or link to this record in EThOS:
Title: Automatic structure and keyphrase analysis of scientific publications
Author: Constantin, Alexandru
ISNI:       0000 0004 5353 3142
Awarding Body: University of Manchester
Current Institution: University of Manchester
Date of Award: 2014
Availability of Full Text:
Access from EThOS:
Access from Institution:
Purpose. This work addresses an escalating problem within the realm of scientific publishing, that stems from accelerated publication rates of article formats difficult to process automatically. The amount of manual labour required to organise a comprehensive corpus of relevant literature has long been impractical. This has, in effect, reduced research efficiency and delayed scientific advancement. Two complementary approaches meant to alleviate this problem are detailed and improved upon beyond the current state-of-the-art, namely logical structure recovery of articles and keyphrase extraction. Methodology. The first approach targets the issue of flat-format publishing. It performs a structural analysis of the camera-ready PDF article and recognises its fine-grained organisation over logical units. The second approach is the application of a keyphrase extraction algorithm that relies on rhetorical information from the recovered structure to better contour an article’s true points of focus. A recount of the scientific article’s function, content and structure is provided, along with insights into how different logical components such as section headings or the bibliography can be automatically identified and utilised for higher-quality keyphrase extraction. Findings. Structure recovery can be carried out independently of an article’s formatting specifics, by exploiting conventional dependencies between logical components. In addition, access to an article’s logical structure is beneficial across term extraction approaches, reducing input noise and facilitating the emphasis of regions of interest. Value. The first part of this work details a novel method for recovering the rhetorical structure of scientific articles that is competitive with state-of-the-art machine learning techniques, yet requires no layout-specific tuning or prior training. The second part showcases a keyphrase extraction algorithm that outperforms other solutions in an established benchmark, yet does not rely on collection statistics or external knowledge sources in order to be proficient.
Supervisor: Not available Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available
Keywords: Text Mining ; Information Extraction ; Keyphrase Extraction ; Logical Structure Recovery