In Wikidata (Wikidata SPARQL endpoint), is there a way to order the SPARQL query results with something like a PageRank?
SELECT DISTINCT ?entity ?entityLabel WHERE {
?entity wdt:P31 wd:Q5.
SERVICE wikibase:label {
bd:serviceParam wikibase:language "en" .
}
} LIMIT 100 OFFSET 0
Can we specify a field to order the results by and that field expresses that the entity at the top is more notable/important/recognizable that the following one and so on?
In case this question is still of interest there is indeed a Wikidata PageRank project (no affiliation with the Wikimedia Foundation). It is hosted at
https://github.com/athalhammer/danker
and you can compute PageRank with Wikidata Q-IDs for any available Wikipedia language (or even the union set of the links of all language versions). Irregularly some computation is also run by the project owner and the resulting scores are hosted at:
https://danker.s3.amazonaws.com/index.html
The output of the computation can then be converted to N-Triples/Turtle (first) and from there to HDT (second).
Option 1: From an endpoint hosting this Wikidata PageRank HDT file (see example here) one can then run federated queries with the live Wikidata endpoint (examples provided in the linked repository and the image below).
Option 2: Use the created Wikidata PageRank HDT file together the latest HDT dump of Wikidata and combine with HDTCat.
Option 3: Don't use HDT and just load the N-Triples/Turtle file into a triple store of your choice together with the Wikidata dump N-Triples/Turtle files.
It seems that PageRank does not make much sense in relation to Wikidata. Obviously, large classes and large aggregates will be leaders.
Also, unlike web links, RDF predicates are "navigable" from both sides; this is just a matter of design, which URI is a subject and which URI is an object.
However, Andreas Thalhammer continues his work. Top 10 Wikidata entities are:
Item | Label | Rank |
---|---|---|
Q729 | animal | 24996.770 |
Q30 | USA | 24772.450 |
Q1360 | Arthropoda | 16930.883 |
Q1390 | insects | 16531.822 |
Q35409 | family | 14403.091 |
Q756 | plant | 14019.927 |
Q142 | France | 13723.484 |
Q34740 | genus | 13718.484 |
Q16 | Canada | 12321.178 |
Q159 | Russia | 11707.160 |
Unfortunately, Wikidata pageranks are not published on the (same) endpoint, one can't query them using SPARQL.
Fortunately, one can figure out some kind of a rank oneself. Possible options are:
Example query:
SELECT ?item ?itemLabel ?outcoming ?sitelinks ?incoming {
?item wdt:P463 wd:Q458 .
?item wikibase:statements ?outcoming .
?item wikibase:sitelinks ?sitelinks .
{
SELECT (count(?s) AS ?incoming) ?item WHERE {
?item wdt:P463 wd:Q458 .
?s ?p ?item .
[] wikibase:directClaim ?p
} GROUP BY ?item
}
SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }.
} ORDER BY DESC (?incoming)
Try it!
As of October 2017, all these metrics are more or less correlated.
Here below are correlation coefficients of these measures for the EU members.
Pearson | outcoming | sitelinks | incoming | pagerank |
---|---|---|---|---|
outcoming | 1.0000 | 0.6907 | 0.7416 | 0.8652 |
sitelinks | 0.6907 | 1.0000 | 0.4314 | 0.5717 |
incoming | 0.7416 | 0.4314 | 1.0000 | 0.8978 |
pagerank | 0.8652 | 0.5717 | 0.8978 | 1.0000 |
Spearman | outcoming | sitelinks | incoming | pagerank |
---|---|---|---|---|
outcoming | 1.0000 | 0.6869 | 0.7619 | 0.8736 |
sitelinks | 0.6869 | 1.0000 | 0.7680 | 0.8342 |
incoming | 0.7619 | 0.7680 | 1.0000 | 0.8872 |
pagerank | 0.8736 | 0.8342 | 0.8872 | 1.0000 |
Kendall | outcoming | sitelinks | incoming | pagerank |
---|---|---|---|---|
outcoming | 1.0000 | 0.4914 | 0.5661 | 0.7143 |
sitelinks | 0.4914 | 1.0000 | 0.5764 | 0.6454 |
incoming | 0.5661 | 0.5764 | 1.0000 | 0.7249 |
pagerank | 0.7143 | 0.6454 | 0.7249 | 1.0000 |
See also:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With