RDF/JS for Data Processing

Posted on 2024-01-21

While working on RDF/JS specifications, I fulfilled two roles.

As the chair, I had the goal that every feasible requirement submitted could be implemented based on what is written in the specification.

Being the author and maintainer of RDF-Ext, I contribute requirements to the group as well. In this blog post, I want to provide you with a closer look at one core idea I pushed forward - an idea that, until now, was spread over multiple discussions and comments on different GitHub issues.

Exchange vs. Processing

At the time we started working on the specification, people mainly struggled with the missing interoperability of RDF objects of different sources like parsers or targets like serializers. It was clear that one requirement was the capability to exchange RDF objects between different libraries.

RDF-Ext has already partially solved that problem by wrapping parsers and serializers with a translation layer. Although this layer costs performance and requires maintenance, it wasn’t such a big concern for me, and I was also looking into problems that should be solved further down the line.

During the early stages, a specification for high-level interfaces was discussed. Since we noticed there are various opinions on that topic, we decided to focus on the low-level interfaces first. Once that was working, it was easier to experiment with higher-level APIs in the scope of libraries.

Processing the RDF data in JavaScript was not easy. Usually, one would use a lot of match calls just to access data. The example below shows how to get the street property for a given term of a person:

1
2
3

const addressQuads = [...personDataset.match(personTerm, ns.schema.address, null)]
const streetQuads = [...personDataset.match(addressQuads[0].object, ns.schema.streetAddress, null)]
const street = streetQuads[0].object.value

SimpleRDF provided an ORM on top of RDF-Ext. Accessing the value from the previous example would be as simple as this:

1	const street = person.address.streetAddress

There is also an alternative approach. Grapoi is more graph-focused but requires a little bit more code for the street address example. But it’s also much more flexible. Here is how the same value can be accessed with Grapoi:

1
2
3

const street = person
  .out(ns.schema.address)
  .out(ns.schema.streetAddress).value

That means we need to agree on data structures for data exchange, make sure the existing data processing use cases still work, and we are done, right?

Advanced Data Processing

We have only scratched the surface of data processing until now. We have all that nice graph data; we can traverse it, yet we face a deficiency in tools and algorithms to fully leverage its potential. Simple tasks like traversing the shortest path are not defined in the SPARQL specification, and most frameworks/libraries don’t provide them either. Another example would be scoring for algorithms like PageRank.

Strech the RDF Model

Implementing such algorithms may require intermediate states, which would be invalid according to the RDF model. In the age of LLMs, let’s use the following example where we have a triple structure, but all parts (subject, predicate, object) are literals. The MagicVectorLLMBox will translate the terms to the correct type. But initially, we would have a structure that would look like this in Turtle:

1
2
3

"Gregory House"
  "is" "Head of Diagnostic Medicine";
  "was born" "1959-05-15".

If the RDF/JS model were stricter, creating triples that match the data would not be possible. One would need to define their own data structures, which I would like to avoid. Any library that follows the RDF/JS specification can handle the following code:

const dataset = rdf.dataset([
  rdf.quad(rdf.literal('Gregory House'), rdf.literal('is'), rdf.literal('Head of Diagnostic Medicine')),
  rdf.quad(rdf.literal('Gregory House'), rdf.literal('was born'), rdf.literal('1959-05-15'))
])

Let’s take a look at another example. We have a server that accepts POST requests and creates resources with IRIs based on the content of the request. Therefore, the server will look for the ex:name of a ex:Root class instance and use that value in the path. The server accepts Turtle with relative IRIs, which are rebased on the created resource. A request from the client could look like this:

@prefix ex: <http://example.org/>.

<> a ex:Root;
  ex:name "root";
  ex:hasChild <a>.

<a>
  ex:name "child".

If relative IRIs are allowed, the content can be just parsed as it is and further processed. If Named Nodes must be absolute IRIs, a temporary base IRI must be given to the parser. Once the resource IRI is found, the temporary base IRI must be rebased. Let’s have a look at the N-Triples representation. That should be close to what most developers see while debugging their code.

Parsed with relative IRIs:

<> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.org/Root>.
<> <http://example.org/hasChild> <a>.
<> <http://example.org/name> "root".
<a> <http://example.org/name> "child".

Parsed with http://localhost/ as temporary base IRI:

<http://localhost/> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.org/Root>.
<http://localhost/> <http://example.org/hasChild> <http://localhost/a>.
<http://localhost/> <http://example.org/name> "root".
<http://localhost/a> <http://example.org/name> "child".

After rebasing:

<http://app.example.org/resource/root> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.org/Root>.
<http://app.example.org/resource/root> <http://example.org/hasChild> <http://app.example.org/resource/root/a>.
<http://app.example.org/resource/root> <http://example.org/name> "root".
<http://app.example.org/resource/root/a> <http://example.org/name> "child".

Identifying the intermediate Named Nodes with the relative IRIs is much easier. From the perspective of lines of code, both solutions would be similar. But it is surely less confusing if a base IRI can be given at the point it’s known and doesn’t require hacks for intermediate processing.

We had a related discussion about absolute vs. relative IRIs in Named Nodes. Some argued that the RDF model only allows absolute IRIs, and therefore, a factory must validate the IRI when creating a Named Node object. But that could have triggered some follow-up problems.

Validation

Validation is a related topic, but it should be considered separately as there is no single validation: it’s modular, and not everyone needs or even wants all validations. I will show you why.

Absolute vs. Relative IRIs

Relative IRIs appear in serialization data, but the RDF model doesn’t allow them. Thus, they are always intermediate. If a format allows relative IRIs, it’s expected to be rebased by an external given base IRI, for example, the URL where the data was fetched from or a file URL if it was read from the local file system. As shown in the previous example, there are cases where that intermediate step could contain logic one doesn’t want to have in the parser. But once the data is written to a triplestore or file, one would expect that all IRIs are absolute.

Datatypes

Datatypes define constraints as well. A very well-known one is xsd:dateTime, which is defined to have a value in the format YYYY-MM-DDThh:mm:ss with an optional timezone. One may have to deal with non-standard format date values in ETL data processing. In a multistep data pipeline, the cleaning may come later, and in this case, it could be useful to keep the original data. Enforcing validation in the persistence layer could become a problem. But it’s even more complicated. One kind of validation could be regular-expression-based, but a stricter validation may check the date in a calendar and detect the 29th of February as an invalid date in a non-leap-year.

Shapes

So far, only validations that work on individual triples have been mentioned. Validation against an ontology is required for business logic, and this is where SHACL comes into play. It works across multiple triples in a data graph. While the previous validations could be implemented in the RDF/JS DataFactory interface, SHACL can not.

Summary

The RDF/JS interfaces were defined very openly, allowing intermediate steps to be handled with the same data objects. That was a requirement to extend the usage of RDF/JS objects beyond data exchange to data processing. It simplifies implementing algorithms. Stricter types aren’t a replacement for validation. Based on previous processing steps from data sources, like parsers, triplestore, and additional validations, you get different guarantees on your RDF data. As there are different levels of validation for different needs, it is the developer’s duty to implement validations according to the needs of the application.

bergis universe of software, hardware and ideas