Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse birth and death dates from Wikipedia?

I'm trying to write a python program that can search wikipedia for the birth and death dates for people.

For example, Albert Einstein was born: 14 March 1879; died: 18 April 1955.

I started with Fetch a Wikipedia article with Python

import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
infile = opener.open('http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&rvsection=0&titles=Albert_Einstein&format=xml')
page2 = infile.read()

This works as far as it goes. page2 is the xml representation of the section from Albert Einstein's wikipedia page.

And I looked at this tutorial, now that I have the page in xml format... http://www.travisglines.com/web-coding/python-xml-parser-tutorial, but I don't understand how to get the information I want (birth and death dates) out of the xml. I feel like I must be close, and yet, I have no idea how to proceed from here.

EDIT

After a few responses, I've installed BeautifulSoup. I'm now at the stage where I can print:

import BeautifulSoup as BS
soup = BS.BeautifulSoup(page2)
print soup.getText()
{{Infobox scientist
| name        = Albert Einstein
| image       = Einstein 1921 portrait2.jpg
| caption     = Albert Einstein in 1921
| birth_date  = {{Birth date|df=yes|1879|3|14}}
| birth_place = [[Ulm]], [[Kingdom of Württemberg]], [[German Empire]]
| death_date  = {{Death date and age|df=yes|1955|4|18|1879|3|14}}
| death_place = [[Princeton, New Jersey|Princeton]], New Jersey, United States
| spouse      = [[Mileva Marić]] (1903–1919)<br>{{nowrap|[[Elsa Löwenthal]] (1919–1936)}}
| residence   = Germany, Italy, Switzerland, Austria, Belgium, United Kingdom, United States
| citizenship = {{Plainlist|
* [[Kingdom of Württemberg|Württemberg/Germany]] (1879–1896)
* [[Statelessness|Stateless]] (1896–1901)
* [[Switzerland]] (1901–1955)
* [[Austria–Hungary|Austria]] (1911–1912)
* [[German Empire|Germany]] (1914–1933)
* United States (1940–1955)
}}

So, much closer, but I still don't know how to return the death_date in this format. Unless I start parsing things with re? I can do that, but I feel like I'd be using the wrong tool for this job.

like image 939
JBWhitmore Avatar asked Sep 03 '12 15:09

JBWhitmore


1 Answers

I came across this question and appreciated all the useful information that was provided in @Yoshiki's answer, but it took some synthesizing to get to a working solution. Sharing here in case it's useful for anyone else. The code is also in this gist for those who wish to fork / improve it.

In particular, there's not much in the way of error handling here ...

import csv
from datetime import datetime
import json
import requests
from dateutil import parser


def id_for_page(page):
    """Uses the wikipedia api to find the wikidata id for a page"""
    api = "https://en.wikipedia.org/w/api.php"
    query = "?action=query&prop=pageprops&titles=%s&format=json"
    slug = page.split('/')[-1]

    response = json.loads(requests.get(api + query % slug).content)
    # Assume we got 1 page result and it is correct.
    page_info = list(response['query']['pages'].values())[0]
    return  page_info['pageprops']['wikibase_item']


def lifespan_for_id(wikidata_id):
    """Uses the wikidata API to retrieve wikidata for the given id."""
    data_url = "https://www.wikidata.org/wiki/Special:EntityData/%s.json"
    page = json.loads(requests.get(data_url % wikidata_id).content)

    claims = list(page['entities'].values())[0]['claims']
    # P569 (birth) and P570 (death) ... not everyone has died yet.
    return [get_claim_as_time(claims, cid) for cid in ['P569', 'P570']]


def get_claim_as_time(claims, claim_id):
    """Helper function to work with data returned from wikidata api"""
    try:
        claim = claims[claim_id][0]['mainsnak']['datavalue']
        assert claim['type'] == 'time', "Expecting time data type"

        # dateparser chokes on leading '+', thanks wikidata.
        return parser.parse(claim['value']['time'][1:])
    except KeyError as e:
        print(e)
        return None


def main():
    page = 'https://en.wikipedia.org/wiki/Albert_Einstein'

    # 1. use the wikipedia api to find the wikidata id for this page
    wikidata_id = id_for_page(page)

    # 2. use the wikidata id to get the birth and death dates
    span = lifespan_for_id(wikidata_id)

    for label, dt in zip(["birth", "death"], span):
        print(label, " = ", datetime.strftime(dt, "%b %d, %Y"))
like image 57
Jason Sundram Avatar answered Oct 01 '22 03:10

Jason Sundram