Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sparql Keys vs distinct values

I have a sparql query that returns duplicates, and I want it to clean them up on one of the values only (subjectID). Unlike DISTINCT that seems to find a unique value for the combination of values selected, rather than for only one of the parameters. I saw someone here propose group by, but that only seems applicable if I list all the parameters after group by (my sparql endpoint complains, e.g. Non-group key variable in SELECT: ?occupation). I tried running an internal select, but it doesn't seem to work for this specific query. So might be an issue with the query itself ( the values of the livedIn optional seem to be causing the duplicate) ?

While happy enough with relational DBs early in the learning curve with SPARQL, so feel free to explain the obvious for the otherwise uninitiated! :)

select distinct  
  ?subjectID ?englishName ?sex ?locatedIn15Name 
  ?dob ?dod ?dom ?bornLocationName ?occupation 
  where { 
      ?person a hc:Person ;
      hc:englishName ?englishName ;
      hc:sex ?sex; 
      hc:subjectID ?subjectID; 
       optional { ?person hc:livedIn11 ?livedIn11 .
           ?livedIn11 hc:englishName ?lived11LocationName . 
           ?livedIn11 hc:locatedIn11 ?locatedIn11 .
           ?locatedIn11 hc:englishName ?locatedIn11Name .
           ?locatedIn11 hc:locatedIn15 ?locatedIn15 .
 ?locatedIn15 hc:englishName ?locatedIn15Name .
 } . 
       optional {?person hc:born ?dob } .
       optional {?person hc:dateOfDeath ?dod } .
       optional {?person hc:dateOfMarriage ?dom } .
       optional { ?person hc:bornIn ?bornIn . 
       ?bornIn hc:englishName ?bornLocationName . 
             ?bornIn hc:easting ?easting . 
             ?bornIn hc:northing ?northing } .
       optional {  ?person hc:occupation ?occupation } 
       FILTER regex(?englishName, "^FirstName LastName")
      } 
  GROUP BY 
  ?subjectID ?englishName  ?sex 
   ?locatedIn15Name ?dob ?dod ?dom 
  ?bornLocationName ?occupation 
like image 712
Nava Avatar asked Jul 10 '12 18:07

Nava


2 Answers

Re the error message:

Non-group key variable in SELECT: ?occupation

You can avoid this by using the SAMPLE() aggregate - this will allow you to just group on ?subjectID but still select values for the rest of the variables provided you only care about getting one value for those other variables.

Here's a simple example of this:

SELECT ?subjectID (SAMPLE(?dob) AS ?dateOfBirth)
WHERE
{
  ?person a hc:Person ;
          hc:subjectID ?subjectID .
  OPTIONAL { ?person hc:born ?dob }
}
GROUP BY ?subjectID
like image 65
RobV Avatar answered Nov 19 '22 18:11

RobV


First thing to note is that there is no such thing as a key, really, in RDF/SPARQL. You're querying a graph, and ?subjectID may simply have several possible combinations of values for the other variables you are selecting. This is caused by the shape of the graph you're querying: perhaps your person has more than one english name, or indeed the other way around: the same english name can be shared by more than one person.

A SPARQL SELECT query is a strange beast: it queries a graph structure but presents the result as a flat table (technically, it's a sequence of sets of variable bindings, but it amounts to the same thing). Duplicates occur because different combinations of values for your variables can be found by basically following different paths in the graph.

The fact that you get duplicate values for ?subjectID in your result is therefore unavoidable, simply because these are, from the point of view of the RDF graph, unique solutions to your query. You can not filter out results without actually losing information, so in general it's hard to give you a solution without knowing more about exactly which 'duplicates' you want to discard: do you only want one possible english name for each subject, or one possible date of birth (even though there may be more than one in your data)?

However, here are some tips for handling/procesing such results more easily:

First of all, you could choose to use an ORDER BY clause on your ?subjectID variable. This will still give you several rows with the same value for ?subjectID, but they'll all be in order, so you can process your result more efficiently.

Another solution is to split your query in two: do a first query that only selects all unique subjects (and possibly all other values for which you know, in advance, that they will be unique given the subject), then iterate over the result and do a separate query to get the other values you're interested in, for each individual subjectID value. This solution may sound like heresy (especially if you're from an SQL background), but it might actually be quicker and easier than trying to do everything in one huge query.

Yet another solution is the one suggested by RobV: using a SAMPLE aggregate on a particular variable to just select one (random) unique value. A variation on that is to use the GROUP_CONCAT aggregate, which creates a single value by concatenating all possible values into a single string.

like image 37
Jeen Broekstra Avatar answered Nov 19 '22 18:11

Jeen Broekstra