Once your SPARQL queries get bigger, you may stumble over the problem that you have duplicate parts of the query or have to deal with performance impacts. Federated queries are affected by some more constraints. The SPARQL Named Query proposal allows the explicit reuse of sub-queries. This blog post will describe the problem in more detail, how the SPARQL Named Query can solve it, and how you can try it already.
Let’s have a look at the problem based on an example: We want to know all movies with narrative or filming location in the capital of Bavaria. First, we search for the capital of Bavaria:
1 | ?city wdt:P31 wd:Q515; # city |
Then we combine two sub-queries with a UNION to get the movies:
1 | { |
And that’s our complete query:
1 | PREFIX bd: <http://www.bigdata.com/rdf#> |
Based on the SPARQL specification, the UNION queries would be processed first, and then the result would be joined with the part that identifies the capital of Bavaria. A query optimizer may run it the other way round, which would speed up the UNION queries, but you can’t rely on that. Explicitly placing it inside the UNION sub-queries would be another option:
1 | PREFIX bd: <http://www.bigdata.com/rdf#> |
SPARQL Named Query let you define the query once and import the result into the UNION sub-queries with VALUES FROM. The final query would look like this:
1 | PREFIX bd: <http://www.bigdata.com/rdf#> |
You may say that you don’t need it because you only work on SPARQL endpoints with a very well-tweaked query optimizer. But with federated queries, the story gets more complicated. The specification mentions that “an implementation of a query planner for federated queries may decide to decompose the query into two queries instead”, but also “Many existing SPARQL endpoints have restrictions in the number of results they return and may miss the ones matching”. So based on the behavior of the query planner, the result could be different. With SPARQL Named Query it’s possible to enforce reducing the query result set on the remote endpoint, which decreases the risk of wrong results caused by a result limit.
Here is an example where data from Wikidata and WikiPathways is combined. Pathways, where the label contains the string vitamin, are identified on the Wikidata side. From WikiPathways, annotations are fetched and joined:
1 | PREFIX dc: <http://purl.org/dc/elements/1.1/> |
And the same query with SPARQL Named Query:
1 | PREFIX dc: <http://purl.org/dc/elements/1.1/> |
Run the query on the SPARQL Named Query Web application
One should be careful with performance comparisons on a public endpoint, but I got very consistent results:
- standard query: ~5s
- query with SPARQL Named Query: ~1.5s
I don’t have access to the intermediate query and result, but I guess the additional time is required to process more results on the remote endpoint and for handing them over to the local endpoint.
If you follow the links of the federated query example, you have already stumbled over the Web application, which does a client-side query translation and processing. You can find the code in the sparql-named-query repository. It also includes a command line tool.
This post and the code covers only SELECT queries. The concept could be extended to create graphs on-the-fly with CREATE and DESCRIBE queries. The result could be accessed anywhere a Named Graph is addressed.
Comments
For comments, please follow the GitHub link.