Use this URL to cite or link to this record in EThOS:
Title: A schema exploration approach for document-oriented data using unsupervised techniques
Author: Bawakid, Fahad
ISNI:       0000 0004 8502 0173
Awarding Body: University of Southampton
Current Institution: University of Southampton
Date of Award: 2019
Availability of Full Text:
Access from EThOS:
Full text unavailable from EThOS. Please try the link below.
Access from Institution:
For more than 40 years, relational data was the dominant force in the world of storing and managing data. However, the complex and strict data model provided by the relational data then started to lose some ground, especially when the data requirements of applications frequently change, thus requiring a flexible data model. Contemporary applications favour less strict data models that could be described as semi-structured or un-structured. These kinds of data are gaining popularity among database developers. For instance, the amount of documentoriented data available on the Web is rising as time passes. A great portion of this data comes from Web APIs. Although document-oriented data is widely used, it cannot be easily consumed or analysed if it lacks description (i.e., metadata) or schemas that explain their internal structure. Schemas are used to make datasets more understandable and easier to query and analyse. Based on our literature review, we found out that current initiatives and tools available for JSON documents in particular, do not provide comprehensive summaries that explore the internal structure of the documents. Our research builds on current work on JSON schema inference by addressing specified research gaps related to extracting accurate and explicit summaries of the internal structure of the JSON documents. This research aims to provide structural summaries to help in understanding JSON documents by explicitly extracting schemas and analysing their attributes. Our approach firstly infers all attributes from a JSON dataset and then applies a clustering algorithm to the documents in order to identify unique schemas; finally, based on the resulting schemas from the clustering process, we apply statistical analysis on the attributes to generate useful summaries by detecting common and schema-specific attributes. The approach is proved and evaluated through real-world and synthetic data collected from the Web in different domains.
Supervisor: Hall, Wendy Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID:  DOI: Not available