Scraping Wikipedia tables with Python selectively

Tags:

I have troubles sorting a wiki table and hope someone who has done it before can give me advice. From the List_of_current_heads_of_state_and_government I need countries (works with the code below) and then only the first mention of Head of state + their names. I am not sure how to isolate the first mention as they all come in one cell. And my attempt to pull their names gives me this error: IndexError: list index out of range. Will appreciate your help!

import requests
from bs4 import BeautifulSoup

wiki = "https://en.wikipedia.org/wiki/List_of_current_heads_of_state_and_government"
website_url = requests.get(wiki).text
soup = BeautifulSoup(website_url,'lxml')

my_table = soup.find('table',{'class':'wikitable plainrowheaders'})
#print(my_table)

states = []
titles = []
names = []
for row in my_table.find_all('tr')[1:]:
    state_cell = row.find_all('a')[0]  
    states.append(state_cell.text)
print(states)
for row in my_table.find_all('td'):
    title_cell = row.find_all('a')[0]
    titles.append(title_cell.text)
print(titles)
for row in my_table.find_all('td'):
    name_cell = row.find_all('a')[1]
    names.append(name_cell.text)
print(names)

Desirable output would be a pandas df:

State | Title | Name |

780

asked May 15 '18 16:05

aviss

2 Answers

I found a super easy and short way to do this, by importing the wikipedia python module and then using pandas' read_html to put it into a dataframe.

From there you can apply any amount of analysis you wish.

import pandas as pd
import wikipedia as wp
html = wp.page("List_of_video_games_considered_the_best").html().encode("UTF-8")
try: 
    df = pd.read_html(html)[1]  # Try 2nd table first as most pages contain contents table first
except IndexError:
    df = pd.read_html(html)[0]
print(df.to_string())

OR if you would like to call it from the command line:

Simply call by python yourfile.py -p Wikipedia_Page_Article_Here

import pandas as pd
import argparse
import wikipedia as wp
parser = argparse.ArgumentParser()
parser.add_argument("-p", "--wiki_page", help="Give a wiki page to get table", required=True)
args = parser.parse_args()
html = wp.page(args.wiki_page).html().encode("UTF-8")
try: 
    df = pd.read_html(html)[1]  # Try 2nd table first as most pages contain contents table first
except IndexError:
    df = pd.read_html(html)[0]
print(df.to_string())

Hope this helps someone out there!

149

answered Sep 18 '22 15:09

rup

If I could understand your question then the following should get you there:

import requests
from bs4 import BeautifulSoup

URL = "https://en.wikipedia.org/wiki/List_of_current_heads_of_state_and_government"

res = requests.get(URL).text
soup = BeautifulSoup(res,'lxml')
for items in soup.find('table', class_='wikitable').find_all('tr')[1::1]:
    data = items.find_all(['th','td'])
    try:
        country = data[0].a.text
        title = data[1].a.text
        name = data[1].a.find_next_sibling().text
    except IndexError:pass
    print("{}|{}|{}".format(country,title,name))

Output:

Afghanistan|President|Ashraf Ghani
Albania|President|Ilir Meta
Algeria|President|Abdelaziz Bouteflika
Andorra|Episcopal Co-Prince|Joan Enric Vives Sicília
Angola|President|João Lourenço
Antigua and Barbuda|Queen|Elizabeth II
Argentina|President|Mauricio Macri

And so on ----

answered Sep 18 '22 15:09

SIM

Related questions
                            
                                Does python have header files like C/C++? [closed]
                            
                                Python 3.4 : cStringIO vs. StringIO
                            
                                Print a variable selected by a random number
                            
                                Python 3.5 Pyperclip module import failure
                            
                                Python regex to extract phone numbers from string
                            
                                How to convert text to speech in python 3.5 on windows 10?
                            
                                Pandas pct change from initial value
                            
                                Python custom module name not defined
                            
                                insert item to list without insert() or append() Python
                            
                                NameError: name 'tree' is not defined
                            
                                TypeError: '<' not supported between instances of 'State' and 'State' PYTHON 3
                            
                                looking for an efficient way to iterate
                            
                                How to display a dataframe in tkinter
                            
                                Why do we need str type? Why not just byte-strings?
                            
                                Efficiency: 2D-list to dictionary in python
                            
                                What is the difference between pandas dtype vs dtypes
                            
                                How to replace ALL characters in a string with one character
                            
                                Python - UnicodeEncodeError: 'charmap' codec can't encode characters in position 85-89: character maps to <undefined>
                            
                                Automatically Detect ODBC Driver using Pyodbc in Python 3
                            
                                Count the number of occurences of a pattern in a list in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Scraping Wikipedia tables with Python selectively

Tags:

python-3.x

beautifulsoup

web-scraping

wikipedia

aviss

People also ask

2 Answers

rup

SIM

Recent Activity

Donate For Us