Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Wikidata results sorted by something similar to a PageRank

In Wikidata (Wikidata SPARQL endpoint), is there a way to order the SPARQL query results with something like a PageRank?

SELECT DISTINCT ?entity ?entityLabel WHERE {
    ?entity wdt:P31 wd:Q5.
    SERVICE wikibase:label {
     bd:serviceParam wikibase:language "en" .
    }
} LIMIT 100 OFFSET 0

Can we specify a field to order the results by and that field expresses that the entity at the top is more notable/important/recognizable that the following one and so on?

like image 253
jordipala Avatar asked Sep 11 '16 16:09

jordipala


2 Answers

In case this question is still of interest there is indeed a Wikidata PageRank project (no affiliation with the Wikimedia Foundation). It is hosted at

https://github.com/athalhammer/danker

and you can compute PageRank with Wikidata Q-IDs for any available Wikipedia language (or even the union set of the links of all language versions). Irregularly some computation is also run by the project owner and the resulting scores are hosted at:

https://danker.s3.amazonaws.com/index.html

The output of the computation can then be converted to N-Triples/Turtle (first) and from there to HDT (second).

Option 1: From an endpoint hosting this Wikidata PageRank HDT file (see example here) one can then run federated queries with the live Wikidata endpoint (examples provided in the linked repository and the image below).

Option 2: Use the created Wikidata PageRank HDT file together the latest HDT dump of Wikidata and combine with HDTCat.

Option 3: Don't use HDT and just load the N-Triples/Turtle file into a triple store of your choice together with the Wikidata dump N-Triples/Turtle files.

Example federated query

like image 157
thalhamm Avatar answered Sep 30 '22 20:09

thalhamm


It seems that PageRank does not make much sense in relation to Wikidata. Obviously, large classes and large aggregates will be leaders.

Also, unlike web links, RDF predicates are "navigable" from both sides; this is just a matter of design, which URI is a subject and which URI is an object.

However, Andreas Thalhammer continues his work. Top 10 Wikidata entities are:

Item Label Rank
Q729 animal 24996.770
Q30 USA 24772.450
Q1360 Arthropoda 16930.883
Q1390 insects 16531.822
Q35409 family 14403.091
Q756 plant 14019.927
Q142 France 13723.484
Q34740 genus 13718.484
Q16 Canada 12321.178
Q159 Russia 11707.160

Unfortunately, Wikidata pageranks are not published on the (same) endpoint, one can't query them using SPARQL.


Fortunately, one can figure out some kind of a rank oneself. Possible options are:

  1. Number of outcoming statements (precalculated);
  2. Number of sitelinks (precalculated);
  3. Number of incoming statements (in the example below, only truthy statements are counted).

Example query:

SELECT ?item ?itemLabel ?outcoming ?sitelinks ?incoming {
    ?item wdt:P463 wd:Q458 .
    ?item wikibase:statements ?outcoming .
    ?item wikibase:sitelinks ?sitelinks .
       {
       SELECT (count(?s) AS ?incoming) ?item WHERE {
           ?item wdt:P463 wd:Q458 .
           ?s ?p ?item .
           [] wikibase:directClaim ?p 
      } GROUP BY ?item
  }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }.  
} ORDER BY DESC (?incoming)

Try it!

As of October 2017, all these metrics are more or less correlated.

scatterplot matrix

Here below are correlation coefficients of these measures for the EU members.

Pearson     outcoming sitelinks incoming pagerank
outcoming 1.0000 0.6907 0.7416 0.8652
sitelinks 0.6907 1.0000 0.4314 0.5717
incoming 0.7416 0.4314 1.0000 0.8978
pagerank 0.8652 0.5717 0.8978 1.0000
Spearman outcoming sitelinks incoming pagerank
outcoming 1.0000 0.6869 0.7619 0.8736
sitelinks 0.6869 1.0000 0.7680 0.8342
incoming 0.7619 0.7680 1.0000 0.8872
pagerank 0.8736 0.8342 0.8872 1.0000
Kendall outcoming sitelinks incoming pagerank
outcoming 1.0000 0.4914 0.5661 0.7143
sitelinks 0.4914 1.0000 0.5764 0.6454
incoming 0.5661 0.5764 1.0000 0.7249
pagerank 0.7143 0.6454 0.7249 1.0000

See also:

  • https://phabricator.wikimedia.org/T143424
  • https://wiki.blazegraph.com/wiki/index.php/RDF_GAS_API#PageRank
  • https://phabricator.wikimedia.org/T162279
like image 41
Stanislav Kralin Avatar answered Sep 30 '22 18:09

Stanislav Kralin