CONSTRUCT a DISTINCT set of triples by following paths in SPARQL

Question

I am trying to write a SPARQL query that will extract all relevant triples from a triplestore, using Construct. Essentially, the triplestore is containing a bunch of JSON-LD documents that got parsed into triples, so there is a predictable set of verbs and pattern, and my goal is to reconstruct one of these documents by getting the relevant triples. The documents were JSON objects roughly 7 nested objects deep, and the structure is generally known but any leaf object may have unknown properties I want to get back. So one way I can go about this is:

CONSTRUCT WHERE
{
  # get top level object
  ?subject <:knownProperty1> ?v1 .
  ?subject <:knownProperty2> ?v2 .
  ?subject <:knownProperty3> ?v3 .

  # leaf subobjects should get all their fields included
  ?v1 ?v1_p ?v1_o .
  ?v2 ?v2_p ?v2_o .
  ?v3 ?v3_p ?v3_o .

  # v3 has these nested objects.
  ?v3 <:knownNest1> ?n1 .
  ?n1 ?n1_p ?n1_o .

  # n2 is the next level of nesting
  ?n1 <:knownNest2> ?n2 .
  ?n2 ?n2_p ?n2_o .

  #... and so on
}

This produces a set of triples that is orders of magnitude larger than the actual document due to duplication -- it is correct but it creates "a graph" for every possible combinatorial match of these values; especially because each level of nesting may have multiple (an array of) subobjects. It gets hairier because many of these known fields are also optional. So for example all the graph matches which assign one concrete value per variable, that include ?subject <:knownProperty1> <:value1>, supply one copy of that triple, resulting in it being included 100s-1000s of times. In my simple test case that I am using to iterate on, there are 106 triples in the input, and fully specifying the allowed structure as shown above results in a CONSTRUCT result set of 5.5 MILLION triples with a query latency (in RAM) of over 60 seconds.

I can handle writing a complex query but I believe this is a code smell given that the basic problem is not that complicated. So my question is:

am I thinking about this wrong ? Is it in fact quite hard in sparql to write a query that would retrieve all the triples following certain paths?
is there a convenient way to use SELECT DISTINCT subqueries to shorten this? All my attempts at this are equivalent to "select each distinct comprehensive match on this pattern", which is no better. I want distinct triples when the pattern matches are combined.

or any other suggestions about the proper way to try this. Thank you!

Benjamin Hofstetter · Accepted Answer

I use the following pattern and process to write construct queries like this.

Start with a SELECT and a UNION for each level.
- to be save don't reuse variable names in different UNION blocks.

SELECT * WHERE
{ 
  {
        # get top level object

  } UNION {
        # leaf subobjects should get all their fields included

  } UNION {
        # v3 has these nested objects.

  } UNION {
        # n2 is the next level of nesting

  } UNION {
        #... and so on

  }
}

Now you can run the query and verify the output. If all is ok write the Replace the 'SELECT *' with your CONSTRUCT body. Imagine your CONSTRUCT template get's called for every line of the table form your SELECT query.

CONSTRUCT {
  # get top level object triples
  .... use the variables from UNION block 1 

  # leaf subobjects should get all their fields included
  .... use the variables from UNION block 2
 
  # v3 has these nested objects.
  .... use the variables from UNION block 3

  # n2 is the next level of nesting  
  .... use the variables from UNION block 4

  # and so on ....

}
WHERE
{ 
  {
        # get top level object

  } UNION {
        # leaf subobjects should get all their fields included

  } UNION {
        # v3 has these nested objects.

  } UNION {
        # n2 is the next level of nesting

  } UNION {
        #... and so on

  }
}

This approach fits well form me.

I avoid unwanted 'duplicates' because I don't mix variables from different union blocks.
With starting with a SELECT I can focus on fetching the needed Triples and in the CONSTRUCT part I can focus on Building the graph.
Separating different things in different UNIONS help me to debug.
I can optimise the query performance of a union block if needed.

Cons: This approach sometimes leads to 'repeat yourself' in the different UNION blocks.

CONSTRUCT a DISTINCT set of triples by following paths in SPARQL

Tags:

rdf

graph-databases

json-ld

sparql

triplestore

qqq

1 Answers

Benjamin Hofstetter

Recent Activity

Donate For Us

CONSTRUCT a DISTINCT set of triples by following paths in SPARQL

Tags:

rdf

graph-databases

json-ld

sparql

triplestore

qqq

1 Answers

Benjamin Hofstetter

Related questions

Recent Activity

Donate For Us