Use this URL to cite or link to this record in EThOS: https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.744127
Title: Improving the performance and scalability of patten subgraph queries
Author: Katsarou, Foteini
ISNI:       0000 0004 7232 5197
Awarding Body: University of Glasgow
Current Institution: University of Glasgow
Date of Award: 2018
Availability of Full Text:
Access from EThOS:
Access from Institution:
Abstract:
Graphs have great representational power, and can thus efficiently represent complex structures, such as chemical compounds and social networks. A common problem that often arises to graphs is the subgraph pattern matching querying problem, where given a graph DB and a query in the form of a graph, the graphs from the DB that contain the query are returned. In some algorithms, all possible occurrences of the query graph in the DB graphs are additionally returned. The subgraph matching problem entails subgraph isomorphism which is known to be NP-Complete. To alleviate the problem, a large number of methods has been proposed over the years that can be classified in two major categories: (i) the filter-then-verify (FTV) and (ii) the subgraph isomorphism (SI) methods. Specifically, the FTV methods rely on a constructed index with the aim to filter out graphs from the DB that definitely do not contain the query graph as an answer. On the remaining set of graphs, which form the so-called candidate set, a subgraph isomorphism algorithm is applied to verify whether the query graph is indeed contained in the DB graph. SI methods target in optimizing their subgraph isomorphism testing process by suggesting different heuristics. With our work, we confirm that both FTV and SI methods suffer from significant performance and scalability limitations, stemming from the NP-complete nature of the subgraph isomorphism problem. Instead of trying to devise new algorithms with better performance compared to the already existing ones, we take a different approach. We suggest a number of solutions to improve their performance and to extend their scalability limitations. In more detail, we conduct a comprehensive analysis of the state of the art FTV methods. We initially identify a set of key-factor parameters that influence the performance of related methods, namely the number of nodes and density per graph, the number of distinct labels and graphs in the graph DB, and the size of the query. Subsequently, using the aforementioned parameters, we perform a large number of experiments with both real and synthetic datasets in a systematic way, where we report on indexing time and size, query processing time and filtering power. We analyze the sensitivity of the various FTV methods. Our analysis helps us draw useful conclusions about the algorithms relative performance. In parallel, we stress-test them and thus, we recognize different scalability limitations, i.e., points where some algorithms operate while others break. One of the conclusions drawn from our experiments with the FTV methods is that as the graphs in the dataset grow large in the number of nodes and/or density and as the query size increases query processing becomes harder. Thus, we additionally bring into the play the state of the art SI methods and along with the top-performing FTV methods as indicated by our aforementioned analysis, we investigate whether all queries of the same size are equally challenging. First, our experiments reveal that all proposed methods suffer from stragglers, i.e., queries with execution times many orders of magnitude worse compared to the majority of them. Second, through our experiments we have seen that isomorphic queries can have widely and wildly different execution times on the various algorithms. Thus, we propose our own isomorphic query rewritings that can introduce large performance gains. Third, we observe that stragglers are algorithm specific, i.e., a straggler query on one algorithm can be a typical query on some other algorithm. We incorporate our findings in a novel proposed framework, coined Psi-framework that runs in parallel different isomorphic instances of the original query and/or different algorithms. Such parallel executions of various algorithms have been used for other NP-hard problems and are known as portfolios of algorithms. Our framework introduces large performance gains in the subgraph matching problem, on both FTV and SI methods across all employed datasets, where some combinations of algorithms perform better than others. Similar to Psi-framework, some portfolios are more favorable than others. Recent proposed methods tend to totally dismiss FTV methods and employ SI methods instead, with the claim that the SI methods enjoy shorter query execution times and that managing the index-based FTV methods is too costly. With our work, we investigate this claim. We initially quantify the constructed index of state of the art SI methods and the top performing FTV method in terms of time and size and we evaluate the efficiency of the constructed indices in filtering out graphs that do not contain the query. Based on our experiments, in both real and synthetic datasets, SI methods fail to avoid a large number of redundant subgraph isomorphism tests. Additionally, our experiments on the SI methods fail to indicate a single-winner. Thus, we propose a hybrid FTV-SI method, as a combination of the filtering achieved by the top-performing FTV method and the verification of various SI methods. This hybrid FTV-SI combination was not studied before, perhaps surprisingly for the problem at hand. Based on our experiments, such a hybrid combination brings high speedups in the subgraph matching problem. In an attempt to reduce even more the underlying indexing costs, we additionally experiment with different values of the enumerated features. Our experiments reveal that we can still achieve high quality filtering, even with smaller features, whereas the overall query execution time is still significantly boosted. With our research results, we hope to open up a whole new research trend where community will benefit from already existing solutions by combining them appropriately to achieve large performance gains.
Supervisor: Not available Sponsor: Not available
Qualification Name: Thesis (Ph.D.) Qualification Level: Doctoral
EThOS ID: uk.bl.ethos.744127  DOI: Not available
Keywords: Q Science (General) ; QA75 Electronic computers. Computer science
Share: