Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Exclude results from DBpedia SPARQL query based on URI prefix

Tags:

sparql

dbpedia

How can I excluding a group of concepts when using the DBpedia SPARQL endpoint? I'm using the following basic query to get a list of concepts:

SELECT DISTINCT ?concept
WHERE {
    ?x a ?concept
}
LIMIT 100

SPARQL Results

This gives me a list of 100 concepts. I want to exclude all the concepts that fall into the YAGO class/group (i.e., whose IRIs begin with http://dbpedia.org/class/yago/). I can filter out individual concepts like this:

SELECT DISTINCT ?concept
WHERE {
    ?x a ?concept
    FILTER (?concept != <http://dbpedia.org/class/yago/1950sScienceFictionFilms>)
}
LIMIT 100

SPARQL Results

But what I can't seem to understand is how to exclude all YAGO sub-classes from my results? I tried using a * like this but this didn't achieve anything:

FILTER (?concept != <http://dbpedia.org/class/yago/*>)

Update:

This query with regex seems to do the trick, but it's really, really slow and ugly. I'm really looking forward to a better alternative.

SELECT DISTINCT ?type WHERE {
  [] a ?type
  FILTER( regex(str(?type), "^(?!http://dbpedia.org/class/yago/).+"))
}
ORDER BY ASC(?type)
LIMIT 10
like image 266
Mohammad Amir Avatar asked Sep 27 '13 07:09

Mohammad Amir


1 Answers

It might seem a little awkward, but your comment about casting to a string and doing some string-based checks is probably on the right track. You can do it a little bit more efficiently using the SPARQL 1.1 function strstarts:

SELECT DISTINCT ?concept
WHERE {
    ?x a ?concept
    FILTER ( !strstarts(str(?concept), "http://dbpedia.org/class/yago/") )
}
LIMIT 100

SPARQL Results

The other alternative would be to find a top level YAGO class, and to exclude those concepts that are rdfs:subClassOf that top level class. This would probably be a better solution in the long run (since it doesn't require casting to strings, and it's based on graph structure). Unfortunately, it doesn't look like there is a single top level YAGO class comparable to owl:Thing. I just downloaded the YAGO type hierarchy from DBpedia's download page and ran this query, which asks for classes with no superclasses, against it:

prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>

select distinct ?root where {
  [] rdfs:subClassOf ?root 
  filter not exists { ?root rdfs:subClassOf ?superRoot }
}

and I got these nine results:

----------------------------------------------------------------
| root                                                         |
================================================================
| <http://dbpedia.org/class/yago/YagoLegalActorGeo>            |
| <http://dbpedia.org/class/yago/WaterNymph109550125>          |
| <http://dbpedia.org/class/yago/PhysicalEntity100001930>      |
| <http://dbpedia.org/class/yago/Abstraction100002137>         |
| <http://dbpedia.org/class/yago/YagoIdentifier>               |
| <http://dbpedia.org/class/yago/YagoLiteral>                  |
| <http://dbpedia.org/class/yago/YagoPermanentlyLocatedEntity> |
| <http://dbpedia.org/class/yago/Thing104424418>               |
| <http://dbpedia.org/class/yago/Dryad109551040>               |
----------------------------------------------------------------

Given that the YAGO concepts aren't quite as structured as some of the others, it looks like the string based approach may be the best in this case. However, if you wanted to, you could do the a non-string-based query like this, which asks for 100 concepts, excluding those which have one of those nine results as a superclass:

select distinct ?concept where {
  [] a ?concept .
  filter not exists {
    ?concept rdfs:subClassOf* ?super .
    values ?super { 
      yago:YagoLegalActorGeo
      yago:WaterNymph109550125
      yago:PhysicalEntity100001930
      yago:Abstraction100002137
      yago:YagoIdentifier
      yago:YagoLiteral
      yago:YagoPermanentlyLocatedEntity
      yago:Thing104424418
      yago:Dryad109551040
    }
  }
}
limit 100

SPARQL Results

I'm not sure which ends up being faster. The first requires a conversion to string, and the strstarts, if implemented in a naïve fashion, has to consume http://dbpedia.org/class/ in each concept before something is a mismatch. The second requires nine comparisons that, if IRIs are interned, are just object identity checks. It's a an interesting question for further investigation.

like image 96
Joshua Taylor Avatar answered Sep 21 '22 09:09

Joshua Taylor