Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to scrape data from multiple wikipedia pages with python?

I want grab the age, place of birth and previous occupation of senators. Information for each individual senator is available on Wikipedia, on their respective pages, and there is another page with a table that lists all senators by the name. How can I go through that list, follow links to the respective pages of each senator, and grab the information I want?

Here is what I've done so far.

1 . (no python) Found out that DBpedia exists and wrote a query to search for senators. Unfortunately DBpedia hasn't categorized most (if any) of them:

 SELECT ?senator, ?country WHERE {
   ?senator rdf:type <http://dbpedia.org/ontology/Senator> .
   ?senator <http://dbpedia.org/ontology/nationality> ?country
 }

Query results are unsatisfactory.

2 . Found out that there is a python module called wikipedia that allows me to search and retrieve information from individual wiki pages. Used it to get a list of senator names from the table by looking at the hyperlinks.

import wikipedia as w
 w.set_lang('pt')

 # Grab page with table of senator names.
 s = w.page(w.search('Lista de Senadores do Brasil da 55 legislatura')[0])

 # Get links to senator names by removing links of no interest
 # For each link in the page, check if it's a link to a senator page.
 senators = [name for name in s.links if not
             # Senator names don't contain digits nor ,
             (any(char.isdigit() or char == ',' for char in name) or
             # And full names always contain spaces.
              ' ' not in name)]

At this point I'm a bit lost. Here the list senators contains all senator names, but also other names, e.g., party names. The wikipidia module (at least from what I could find in the API documentation) also doesn't implement functionality to follow links or search through tables.

I've seen two related entries here on StackOverflow that seem helpful, but they both (here and here) extract information from a single page.

Can anyone point me towards a solution?

Thanks!

like image 563
dangom Avatar asked Sep 02 '16 19:09

dangom


1 Answers

Ok, so I figured it out (thanks to a comment pointing me to BeautifulSoup).

There is actually no big secret to achieve what I wanted. I just had to go through the list with BeautifulSoup and store all the links, and then open each stored link with urllib2, call BeautifulSoup on the response, and.. done. Here is the solution:

import urllib2 as url
import wikipedia as w
from bs4 import BeautifulSoup as bs
import re

# A dictionary to store the data we'll retrieve.
d = {}

# 1. Grab  the list from wikipedia.
w.set_lang('pt')
s = w.page(w.search('Lista de Senadores do Brasil da 55 legislatura')[0])
html = url.urlopen(s.url).read()
soup = bs(html, 'html.parser')


# 2. Names and links are on the second column of the second table.
table2 = soup.findAll('table')[1]
for row in table2.findAll('tr'):
    for colnum, col in enumerate(row.find_all('td')):
        if (colnum+1) % 5 == 2:
            a = col.find('a')
            link = 'https://pt.wikipedia.org' + a.get('href')
            d[a.get('title')] = {}
            d[a.get('title')]['link'] = link


# 3. Now that we have the links, we can iterate through them,
# and grab the info from the table.
for senator, data in d.iteritems():
    page = bs(url.urlopen(data['link']).read(), 'html.parser')
    # (flatten list trick: [a for b in nested for a in b])
    rows = [item for table in
            [item.find_all('td') for item in page.find_all('table')[0:3]]
            for item in table]
    for rownumber, row in enumerate(rows):
        if row.get_text() == 'Nascimento':
            birthinfo = rows[rownumber+1].getText().split('\n')
            try:
                d[senator]['birthplace'] = birthinfo[1]
            except IndexError:
                d[senator]['birthplace'] = ''
            birth = re.search('(.*\d{4}).*\((\d{2}).*\)', birthinfo[0])
            d[senator]['birthdate'] = birth.group(1)
            d[senator]['age'] = birth.group(2)
        if row.get_text() == 'Partido':
            d[senator]['party'] = rows[rownumber + 1].getText()
        if 'Profiss' in row.get_text():
            d[senator]['profession'] = rows[rownumber + 1].getText()

Pretty simple. BeautifulSoup works wonders =)

like image 52
dangom Avatar answered Oct 21 '22 00:10

dangom