Use this URL to cite or link to this record in EThOS:
Title: On the Helmholtz principle for text mining
Author: Dadachev, Boris
ISNI:       0000 0004 5371 0540
Awarding Body: Cardiff University
Current Institution: Cardiff University
Date of Award: 2015
Availability of Full Text:
Access from EThOS:
Access from Institution:
The majority of text mining systems rely on bag-of-words approaches, representing textual documents as multi-sets of their constituent words. Using term weighting mechanisms, this simple representation allows to derive features that can be used as input by many different algorithms and for a variety of applications, including document classification, information retrieval, sentiment analysis, etc. Since the performance of many mining algorithms directly depend on term weights, techniques for quantifying term importance are of paramount importance in text processing. This thesis takes advantage of recent advances in keyword extraction mechanisms, which further select the terms with the highest weights to keep only the most important words. More precisely, building on a recent keyword extraction technique, we develop novel text mining algorithms for information retrieval, text segmentation and summarization. We find these algorithms to provide state-of-the-art performance using standard evaluation techniques. However, contrary to many state-of-the-art algorithms, we try to make as few assumptions as possible on the data to analyze while keeping good computational performances, both in terms of speed and accuracy. As such, our algorithms can work with inputs from a variety of domains and languages, but they can also run in environments with limited resources. Additionally, in a field that tends to be dominated by empirical approaches, we strive to rely on sound and rigorous mathematical principles.
Supervisor: Not available Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available
Keywords: QA75 Electronic computers. Computer science