Title:
|
Query answering in distributed RDF databases
|
To simplify data integration and exchange, modern applications often represent their data using the Resource Description Framework (RDF). As the amount of the available data keeps increasing, many RDF datasets cannot be processed using centralised RDF stores. A common solution is to distribute RDF data in a cluster of shared-nothing servers, and to query the data using a distributed query algorithm. Existing approaches typically use a variant of the data exchange operator to shuffle partial query answers between servers and thus ensure that every query answer is produced. Decisions as to when and where to shuffle the data are usually made statically - that is, at query compile time. In this thesis, we argue that such approaches can miss opportunities for local computation and thus incur considerable overheads. Moreover, we present a novel distributed query evaluation algorithm for RDF based on dynamic data exchange, where all computation that can be done locally is guaranteed to be performed on a single server. Our approach can successfully process any query even if the memory available at each server is bounded, and we argue that this is critical in distributed systems where intermediate results can easily exceed the capacity of each server. We also present a new query planning approach that balances the cost of communication against the cost of local processing at each server, as well as a new approach to partitioning RDF data that aims to increase locality in each server. We have implemented our approach in the well-known RDFox data store, and our empirical evaluation suggests that our techniques can outperform the state of the art by orders of magnitude in terms of query evaluation times, network communication, and memory use.
|