Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to select random DBPedia nodes from SPARQL?

How can I select random sample from DBpedia using the sparql endpoint?

This query

SELECT ?s WHERE { ?s ?p ?o . FILTER ( 1 > bif:rnd (10, ?s, ?p, ?o) ) } LIMIT 10

(found here) seems to work ok on most SPARQL endpoints, but on http://dbpedia.org/sparql it gets cached (so it returns always the same 10 nodes).

If i try from JENA, I get the following exception:

Unresolved prefixed name: bif:rnd

And I can't find the what the 'bif' namespace is.

Any idea on how to solve this?

Mulone

like image 813
Mulone Avatar asked Apr 15 '11 13:04

Mulone


4 Answers

In SPARQL 1.1 you can do:

SELECT ?s
WHERE {
  ?s ?p ?o
}
ORDER BY RAND()
LIMIT 10

I don't know offhand how many store will optimise, or even implement this yet though.

[see comment below, this doesn't quite work]

An alternative is:

SELECT (SAMPLE(?s) AS ?ss)
WHERE { ?s ?p ?o }
GROUP BY ?s

But I'd think that's even less likely to be optimised.

like image 179
Steve Harris Avatar answered Oct 25 '22 13:10

Steve Harris


bif:rnd is not SPARQL standard and therefore not portable to any SPARQL endpoint. You can use LIMIT , ORDER and OFFSET to simulate a random sample with a standard query. Something like ...

SELECT * WHERE { ?s ?p ?o } 
ORDER BY ?s OFFSET $some_random_number$ LIMIT 10

Where some_random_number is a number that is generated by your application. This should avoid the caching problem but this query is anyway quite expensive and I don't know if public endpoints will support it.

Try to avoid completely open patterns like ?s ?p ?o and your query will be much more efficient.

like image 29
Manuel Salvadores Avatar answered Oct 25 '22 15:10

Manuel Salvadores


bif:rnd is a Virtuoso specific extension and will thus only work again Virtuoso SPARQL endpoints.

bif is the prefix for Virtuoso Built In Functions which enable any Virtuoso function to be called in SPARQL, with rnd being a Virtuoso function for returning random numbers.

like image 26
hwilliams Avatar answered Oct 25 '22 13:10

hwilliams


I encountered the same problem and none of the solutions here addressed my issue. Here is my solution; it was non-trivial and quite a hack. This works for DBPedia as of now, and may work for other SPARQL endpoints, but it is not guaranteed to work for future releases.

DBPedia uses Virtuoso, which supports an undocumented argument to the RAND function; the argument effectively specifies the range to use for the PRNG. The game is to trick Virtuoso into believing that the input argument cannot be statically-evaluated before each result row is computed, forcing the program to evaluate RAND() for every binding:

select * {
    ?s dbo:isPartOf ?o .  # Whatever your pattern is
    bind(rand(1 + strlen(str(?s))*0) as ?rid)
} order by ?rid

The magic happens in rand(1 + strlen(str(?s))*0) which generates the equivalent of rand(); but forces it to run on every match by exploiting the fact that the program cannot predict the value of an expression that involves some variable (in this case, we just compute the length of the IRI as a string). The actual expression is not important, since we multiply it by 0 to ignore it completely, then add 1 to make rand execute normally.

This only works because the developers did not go this far in their static-code-evaluation of expressions. They could have easily written a branch for "multiply by zero", but alas they did not :)

like image 34
Blake Regalia Avatar answered Oct 25 '22 13:10

Blake Regalia