I'm trying to write a python program that can search wikipedia for the birth and death dates for people.
For example, Albert Einstein was born: 14 March 1879; died: 18 April 1955.
I started with Fetch a Wikipedia article with Python
import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
infile = opener.open('http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&rvsection=0&titles=Albert_Einstein&format=xml')
page2 = infile.read()
This works as far as it goes. page2
is the xml representation of the section from Albert Einstein's wikipedia page.
And I looked at this tutorial, now that I have the page in xml format... http://www.travisglines.com/web-coding/python-xml-parser-tutorial, but I don't understand how to get the information I want (birth and death dates) out of the xml. I feel like I must be close, and yet, I have no idea how to proceed from here.
EDIT
After a few responses, I've installed BeautifulSoup. I'm now at the stage where I can print:
import BeautifulSoup as BS
soup = BS.BeautifulSoup(page2)
print soup.getText()
{{Infobox scientist
| name = Albert Einstein
| image = Einstein 1921 portrait2.jpg
| caption = Albert Einstein in 1921
| birth_date = {{Birth date|df=yes|1879|3|14}}
| birth_place = [[Ulm]], [[Kingdom of Württemberg]], [[German Empire]]
| death_date = {{Death date and age|df=yes|1955|4|18|1879|3|14}}
| death_place = [[Princeton, New Jersey|Princeton]], New Jersey, United States
| spouse = [[Mileva Marić]]&nbsp;(1903–1919)<br>{{nowrap|[[Elsa Löwenthal]]&nbsp;(1919–1936)}}
| residence = Germany, Italy, Switzerland, Austria, Belgium, United Kingdom, United States
| citizenship = {{Plainlist|
* [[Kingdom of Württemberg|Württemberg/Germany]] (1879–1896)
* [[Statelessness|Stateless]] (1896–1901)
* [[Switzerland]] (1901–1955)
* [[Austria–Hungary|Austria]] (1911–1912)
* [[German Empire|Germany]] (1914–1933)
* United States (1940–1955)
}}
So, much closer, but I still don't know how to return the death_date in this format. Unless I start parsing things with re
? I can do that, but I feel like I'd be using the wrong tool for this job.
I came across this question and appreciated all the useful information that was provided in @Yoshiki's answer, but it took some synthesizing to get to a working solution. Sharing here in case it's useful for anyone else. The code is also in this gist for those who wish to fork / improve it.
In particular, there's not much in the way of error handling here ...
import csv
from datetime import datetime
import json
import requests
from dateutil import parser
def id_for_page(page):
"""Uses the wikipedia api to find the wikidata id for a page"""
api = "https://en.wikipedia.org/w/api.php"
query = "?action=query&prop=pageprops&titles=%s&format=json"
slug = page.split('/')[-1]
response = json.loads(requests.get(api + query % slug).content)
# Assume we got 1 page result and it is correct.
page_info = list(response['query']['pages'].values())[0]
return page_info['pageprops']['wikibase_item']
def lifespan_for_id(wikidata_id):
"""Uses the wikidata API to retrieve wikidata for the given id."""
data_url = "https://www.wikidata.org/wiki/Special:EntityData/%s.json"
page = json.loads(requests.get(data_url % wikidata_id).content)
claims = list(page['entities'].values())[0]['claims']
# P569 (birth) and P570 (death) ... not everyone has died yet.
return [get_claim_as_time(claims, cid) for cid in ['P569', 'P570']]
def get_claim_as_time(claims, claim_id):
"""Helper function to work with data returned from wikidata api"""
try:
claim = claims[claim_id][0]['mainsnak']['datavalue']
assert claim['type'] == 'time', "Expecting time data type"
# dateparser chokes on leading '+', thanks wikidata.
return parser.parse(claim['value']['time'][1:])
except KeyError as e:
print(e)
return None
def main():
page = 'https://en.wikipedia.org/wiki/Albert_Einstein'
# 1. use the wikipedia api to find the wikidata id for this page
wikidata_id = id_for_page(page)
# 2. use the wikidata id to get the birth and death dates
span = lifespan_for_id(wikidata_id)
for label, dt in zip(["birth", "death"], span):
print(label, " = ", datetime.strftime(dt, "%b %d, %Y"))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With