Use this URL to cite or link to this record in EThOS:
Title: Quantifying textual similarities across scientific research communities
Author: Duffee, Boyd
ISNI:       0000 0004 7427 8136
Awarding Body: Keele University
Current Institution: Keele University
Date of Award: 2018
Availability of Full Text:
Access from EThOS:
Access from Institution:
There are well-established approaches of text mining collections of documents and for understanding the network of citations between academic papers. Few studies have examined the textual content of the papers that constitute a citation network. A document corpus was obtained from the arXiv repository, selected from papers relating to the subject of Dark Matter and a citation network was created from the data held by NASA’s Astrophysics Data System on those papers, their citations and references. I use the Louvain community-finding algorithm on the Dark Matter network to identify groups of papers with a higher density of citations and compare the textual similarity between papers in the Dark Matter corpus using the Vector Space Model of document representation and the cosine similarity function. It was found that pairs of papers within a citation community have a higher similarity than they do with papers in other citation communities. This implies that content is associated with structure in scientific citation networks, which opens avenues for research on network communities for finding ground-truth using advanced Text Mining techniques, such as Topic Modelling. It was found that using the titles of papers in a citation network community was a good method for identifying the community. The power law exponent of the degree distribution was found to be, = 2.3, lower than results reported for other citation networks. The selection of papers based on a single subject, rather than based on a journal or category, is suggested as the reason for this lower value. It was also found that the degree pair correlation of the citation network classifies it as a disassortative network with a cut-off value at degree kc = 30.The textual similarity of documents decreases linearly with age over a 15 year timespan.
Supervisor: Rugg, Gordon Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available
Keywords: QA75 Electronic computers. Computer science