Accelerating data retrieval steps in XML documents
The aim of this research is to accelerate the data retrieval steps in a collection
of XML (eXtensible Markup Language) documents, a key task of current XML
research. The following three inter-connected issues relating to the state-of-theart
XML research are thus studied: semantically clustering XML documents,
efficiently querying XML document with an index structure and self-adaptively
labelling dynamic XML documents, which form a basic but self-contained foundation
of a native XML database system.
This research is carried out by following a divide-and-conquer strategy. The
issue of dividing a collection of XML documents into sub-clusters, in which semantically
similar XML documents are grouped together, is addressed at first.
To achieve this purpose, a semantic component model to model the implicit semantic
of an XML document is proposed. This model enables us to devise a set of
heuristic algorithms to' compute the degree of similarity among XML documents.
In particular, the newly proposed semantic component model and the heuristic
algorithms reflect the inaccuracy of the traditional edit-distance-based clustering
mechanisms. After similar XML documents are grouped into sub-collections,the problem of querying XML documents with an index structure is carefully
studied. A novel geometric sequence model is proposed to transform XML documents
into numbered geometric sequences and XPath queries into geometric
query sequences. The problem of evaluating an XPath query in an XML document
is theoretically proved to be equal to the problem of finding the subsequence
.matchings of a geometric query sequence in a numbered geometric document sequence.
This geometric sequence model then enables us to devise two new stackbased
algorithms to perform both top-down and bottom-up XPath evaluation in
XML documents. In particular, the algorithms treat an XPath query as a whole
unit, avoiding resource-consuming join operations and generating all the answers
without semantic errors and false alarms. Finally the issue of supporting update
functions in XML documents is tackled. A new Bayesian allocation model is introduced
for the index structure generated in geometric sequence model. Based
on k-ary tree data structure and the level traversal mechanism, the correctness
and efficiency of the Bayesian allocation model in supporting dynamic XML documents
is theoretically proved. In particular, the Bayesian allocation model is
general and can be applied to most of the current index structures.