I am writing a python script to extract "Entity names" from a collection of thousands of news articles from a few countries and languages.
I would like to make use of the amazing DBPedia structured knwoledge, say for example to look up the names of "artists in egypt" and names of "companies in Canada".
(If these information was in SQL form, I would have had no problem.)
I would prefer to download the DBPedia content and use it offline. any ideas of what is needed to do so and how to query it locally from python ?
DBpedia content is in RDF format. The dumps can be download from here
Dbpedia is a large dataset in RDF, for handling that amount of data you need to use Triple Store technology. For Dbpedia you will need one of native triple stores, I recommend you to use either Virtuoso or 4store. I personally prefer 4store.
Once you have your triple store set up with Dbpedia in it. You can use SPARQL to query the Dbpedia RDF triples. There are Python libraries that can help you with that. 4store and Virtuoso can give you results back in JSON so you can easily get by without any libraries.
Some simple urllib script like ...
def query(q,epr,f='application/json'):
try:
params = {'query': q}
params = urllib.urlencode(params)
opener = urllib2.build_opener(urllib2.HTTPHandler)
request = urllib2.Request(epr+'?'+params)
request.add_header('Accept', f)
request.get_method = lambda: 'GET'
url = opener.open(request)
return url.read()
except Exception, e:
traceback.print_exc(file=sys.stdout)
raise e
can help you out to run SPARQL ... for instance
>>> q1 = """
... select ?birthPlace where {
... <http://dbpedia.org/resource/Claude_Monet> <http://dbpedia.org/property/birthPlace> ?birthPlace .
... }"""
>>> print query(q1,"http://dbpedia.org/sparql")
{ "head": { "link": [], "vars": ["birthPlace"] },
"results": { "distinct": false, "ordered": true, "bindings": [
{ "birthPlace": { "type": "literal", "xml:lang": "en", "value": "Paris, France" }} ] } }
>>>
I hope this gives you an idea of how to start.
In python3 the answer will look like this using the requests library:
def query(q, epr, f='application/json'):
try:
params = {'query': q}
resp = requests.get(epr, params=params, headers={'Accept': f})
return resp.text
except Exception as e:
print(e, file=sys.stdout)
raise
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With