Title:
|
Discovering culturomic trends in large-scale textual corpora
|
The abundance of data and the ability to process it at a massive scale has
transformed many areas of research in the natural sciences. These data-driven
methods have recently begun to be adopted in other fields of research which
traditionally have not relied on computational approaches, such as the social
sciences and humanities. As we continue forward, we will likely see an increase
in the spread of data-driven approaches in these fields as more and more data
is "born digital", coupled with mass digitalisation projects that aim to digitise
the mountains of paper archives that still exist.
In this thesis, we look at extracting, analysing and delving into data
from massive textual corpora, concentrating on macroscopic trends and
characteristics that can only be found when transitioning from traditional
social science methods involving manual inspection known as 'coding' to
scalable, data-driven computational methods.
A distributed architecture for large-scale text analysis was collaboratively
developed during the project, serving as the infrastructure for collecting, storing
and analysing data. Using this infrastructure, this thesis not only explores
methods for extracting information in a scalable way but also demonstrates the
types of studies that can be achieved by adopting data-driven approaches.
These studies and their findings include differences in writing style across
topics and news outlets; longitudinal and diurnal pat ferns of mood change
in population-scale samples of UK social media users; and general tools and
methods that can be used to interrogate and explore massive textual corpora
in an interactive way.
We conclude that data-driven methods for the analysis of large-scale
textual corpora have now reached a point where the extraction of macroscopic
trends and patterns can enable meaningful information about the real-world to
be discovered.
|