Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting RDF triples from Wikidata

I'm following this guide on querying from Wikidata.

I can get a certain entity (if I know its code) using with:

from wikidata.client import Client
client = Client()
entity = client.get('Q20145', load=True)
entity
>>><wikidata.entity.Entity Q20145 'IU'>
entity.description
>>>m'South Korean singer-songwriter, record producer, and actress'

But how can I get the RDF triples of that entity? That is, all the outgoing and incoming edges in the form of (subject, predicate, object)

Looks like this SO question managed to get the triples, but only from a data dump here. I'm trying to get it from the library itself.

like image 329
Penguin Avatar asked Sep 10 '21 19:09

Penguin


2 Answers

If you only needed the outgoing edges, you could retrieve them directly by calling https://www.wikidata.org/wiki/Special:EntityData/Q20145.nt

from rdflib import Graph
g = Graph()
g.parse('https://www.wikidata.org/wiki/Special:EntityData/Q20145.nt', format="nt")    
for subj, pred, obj in g:
    print(subj, pred, obj)

To get the incoming and outgoing edges, you need to query the database. On Wikidata, this is done using the Wikidata Query Service and the query langauge SPARQL. The SPARQL expression to get all edges is as simple as DESCRIBE wd:Q20145.

With Python, you can retrieve the results of the query with the following code:

import requests
import json

endpoint_url = "https://query.wikidata.org/sparql"
headers = { 'User-Agent': 'MyBot' }
payload = {
    'query': 'DESCRIBE wd:Q20145',
    'format': 'json'
}
r = requests.get(endpoint_url, params=payload, headers=headers)
results = r.json()

triples = []
for result in results["results"]["bindings"]:   
    triples.append((result["subject"], result["predicate"], result["object"]))
print(triples)

This gives you the full result origin from the complex underlying data model. If you want to query the incoming and outgoing edges separately, write instead of DESCRIBE wd:Q20145 either CONSTRUCT {?s ?p ?o} WHERE {BIND(wd:Q20145 AS ?s) ?s ?p ?o} to only have the outgoing edges or CONSTRUCT {?s ?p ?o} WHERE {BIND(wd:Q20145 AS ?o) ?s ?p ?o} to only have the incoming edges.

Depending on your goal, you may want to filter out some triples, e.g. statement triples, and you may want to simplify some triples. A possibility to get a clearer result is to replace the last four lines by:

triples = []
for result in results["results"]["bindings"]:   
    subject = result["subject"]["value"].replace('http://www.wikidata.org/entity/', '')
    object = result["object"]["value"].replace('http://www.wikidata.org/entity/', '')
    predicate = result["predicate"]["value"].replace('http://www.wikidata.org/prop/direct/', '')
    if 'statement/' in subject or 'statement/' in object:
        continue
    triples.append((subject, predicate, object))
print(triples)
like image 77
Pascalco Avatar answered Oct 16 '22 19:10

Pascalco


But how can I get the RDF triples of that entity?

By using SPARQL DESCRIBE query (source), you get a single result RDF graph containing all the outgoing and incoming edges in the form of (subject, predicate, object). This can be achieved using the following Python example code (source):

from SPARQLWrapper import SPARQLWrapper

queryString = """DESCRIBE wd:Q20145"""
sparql = SPARQLWrapper("https://query.wikidata.org/sparql")

sparql.setQuery(queryString)
sparql.setReturnFormat(JSON)
results = sparql.query().convert()

for result in results["results"]["bindings"]:
    print(result)

If you want to get only the outgoing edges, use CONSTRUCT {?s ?p ?o} WHERE {BIND(wd:Q20145 AS ?s) ?s ?p ?o} and for the incoming edges, use CONSTRUCT {?s ?p ?o} WHERE {BIND(wd:Q20145 AS ?o) ?s ?p ?o} (thanks to @ UninformedUser).

Example code:

from SPARQLWrapper import SPARQLWrapper

queryString = """CONSTRUCT {?s ?p ?o} WHERE {BIND(wd:Q20145 AS ?s) ?s ?p ?o}"""
sparql = SPARQLWrapper("https://query.wikidata.org/sparql")

sparql.setQuery(queryString)
sparql.setReturnFormat(JSON)
results = sparql.query().convert()

for result in results["results"]["bindings"]:
    print(result)

The result with DESCRIBE and CONSTRUCT can be seen here and here respectively.

like image 24
R. Marolahy Avatar answered Oct 16 '22 20:10

R. Marolahy