One of my projects encountered a performance problem with the available SHACL engines. Besides, I needed to know which triples had been processed (coverage/fragments), so I implemented a new SHACL engine in JavaScript from scratch.
Performance issue
I collect a lot of data from my IoT devices at home.
Everything is mapped into RDF and stored on a triplestore.
Since sensors sometimes generate garbage data, I decided to use SHACL shapes to ensure data quality.
Everything worked fine on my PC, but I noticed a performance issue on my Raspberry Pi, where I’m running some other services.
I compared the performance of the rdf-validate-shacl
JavaScript library with the pySHACL
Python library and got similar results.
Profiling and analysis
After having a closer look at the rdf-validate-shacl
library with a profiler, I found many calls to Dataset.match
.
As I went through the code, it was clear what was going on: Each time a shape was processed, all properties and values were fetched from the Dataset
object.
And that happens quite often for the same shape.
The library is based on the reference implementation, and for this case, it makes sense to use the most straightforward solution to match the logic of the code with the specification.
However, I wanted a production-ready library.
Implementation
Production-ready doesn’t mean readability will be ignored, and the shortest path to better performance will be used. I looked for a structure that fits best for the library and decided to implement an engine that creates shapes with a compile step.
First, it goes through all properties of a shape, finds the matching compile function, calls the compile function, and adds the returned function to the shape.
The compile function takes care to read all required arguments from the dataset and, if required, preprocesses the data. A new function is returned that binds all the arguments and doesn’t require a shape context.
The generic properties like message
and severity
are implemented as getter functions.
The first time the getter function is called, the value is fetched from the dataset.
The following calls return a cached value.
Below is the code for the sh:maxCount
constraint.
The compileMaxCount
function takes care to read the object value and converts it to an integer value.
The validateMaxCountProperty
function binds the given maxCount
argument and returns the validation function.
1 | function compileMaxCount (shape) { |
The validation code just needs to go through all compiled validation functions. Nested shapes are compiled on demand. This approach adds almost zero overhead for one-time validations but gives a significant performance boost if the validation of a shape is called multiple times.
Benchmark
The following picture shows the runtime required to validate the SHACL Shapes to Validate Shapes Graphs from the specification by itself. Three different engines are included in this comparison:
- pySHACL: The rdflib-based Python implementation.
- rdf-validate-shacl: A fork of the JavaScript reference implementation that uses RDF/JS interfaces.
- shacl-engine: My new JavaScript engine.
shacl-engine
was tested with different configurations:
+c
: Coverage enabled.+k
: The validator instance has been constructed only once, and the instance has been reused for all loops.
Each configuration was processed 100x times.
It’s surprising, but keeping the validator instance took more time. I’m not sure why it is like that, but modern VMs are a complex topic.
engine | mean | median | factor |
---|---|---|---|
pyshacl | 643.13 | 640.94 | 16.08 |
rdf-validate-shacl | 632.39 | 629.47 | 15.79 |
rdf-validate-shacl +k | 717.74 | 722.03 | 18.12 |
shacl-engine | 40.28 | 39.85 | 1.00 |
shacl-engine +c | 50.68 | 50.03 | 1.26 |
shacl-engine +c +k | 59.92 | 59.84 | 1.50 |
shacl-engine +k | 42.26 | 41.57 | 1.04 |
The table includes the factor compared to the shacl-engine
run, which was the fasted of all configurations.
As you can see, the performance of pySHACL
and rdf-validate-shacl
is very similar, and shacl-engine
is about 15x faster.
Coverage/fragments
Another problem, maybe some of you have already stumbled over, is a failing SHACL target generates a report where everything is fine. For this use case, knowing which triples have been matched by the shape would be helpful. It could also be used to create sub-graphs based on shapes as a Concise Bounded Description (CBD) alternative with more control.
shacl-engine
implements such a feature.
The information of the triples matched by a specific shape is stored in the details of the report.
That can be very useful for debugging.
There are much more details related to that feature that would be worth talking about, like the library called grapoi I wrote for graph traversing, which keeps track of the traversed path, but that will be a topic for another blog post.
I got a hint on the SHACL Discord server that there is a fork of pySHACL
that implements a similar feature.
They call it shape fragments.
The fork can be found here, and the authors also published a paper.
Playground
The playground application contains two text editors to copy and edit SHACL shapes and data in Turtle or JSON-LD format. The shapes and the report can be explored with a tree viewer. The coverage is shown in a text editor and visualized in a graph network component which is quite helpful for debugging. It’s a quick and dirty application, and the UI has a lot of space for improvements, but its functionality improved my process of creating complex shapes.
Thanks
I want to thank Holger Knublauch, who implemented the reference implementation, and Martin Maillard for porting the reference implementation to RDF/JS. It was good to have the other implementations to test edge cases.
Comments
For comments, please follow the GitHub link.