Use this URL to cite or link to this record in EThOS:
Title: A framework for processing correlated probabilistic data
Author: van Schaik, Sebastiaan Johannes
ISNI:       0000 0004 5915 6237
Awarding Body: University of Oxford
Current Institution: University of Oxford
Date of Award: 2014
Availability of Full Text:
Access from EThOS:
Full text unavailable from EThOS. Please try the link below.
Access from Institution:
The amount of digitally-born data has surged in recent years. In many scenarios, this data is inherently uncertain (or: probabilistic), such as data originating from sensor networks, image and voice recognition, location detection, and automated web data extraction. Probabilistic data requires novel and different approaches to data mining and analysis, which explicitly account for the uncertainty and the correlations therein. This thesis introduces ENFrame: a framework for processing and mining correlated probabilistic data. Using this framework, it is possible to express both traditional and novel algorithms for data analysis in a special user language, without having to explicitly address the uncertainty of the data on which the algorithms operate. The framework will subsequently execute the algorithm on the probabilistic input, and perform exact or approximate parallel probability computation. During the probability computation, correlations and provenance are succinctly encoded using probabilistic events. This thesis contains novel contributions in several directions. An expressive user language – a subset of Python – is introduced, which allows a programmer to implement algorithms for probabilistic data without requiring knowledge of the underlying probabilistic model. Furthermore, an event language is presented, which is used for the probabilistic interpretation of the user program. The event language can succinctly encode arbitrary correlations using events, which are the probabilistic counterparts of deterministic user program variables. These highly interconnected events are stored in an event network, a probabilistic interpretation of the original user program. Multiple techniques for exact and approximate probability computation (with error guarantees) of such event networks are presented, as well as techniques for parallel computation. Adaptations of multiple existing data mining algorithms are shown to work in the framework, and are subsequently subjected to an extensive experimental evaluation. Additionally, a use-case is presented in which a probabilistic adaptation of a clustering algorithm is used to predict faults in energy distribution networks. Lastly, this thesis presents techniques for integrating a number of different probabilistic data formalisms for use in this framework and in other applications.
Supervisor: Olteanu, Dan Sponsor: EU-FP7 ; Engineering and Physical Sciences Research Council ; Dutch Prins Bernhard Cultuurfondsbeurs ; De Breed Kreiken Innovatiefonds
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available
Keywords: Computer science (mathematics) ; Applications and algorithms ; Software engineering ; data mining ; probabilistic data ; uncertain data ; ENFrame ; probability computation ; clustering ; classification ; correlations ; machine learning