Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert Wikipedia wikitable to Python Pandas DataFrame?

In Wikipedia, you can find some interesting data to be sorted, filtered, ...

Here is a sample of a wikitable

{| class="wikitable sortable"
|-
! Model !! Mhash/s !! Mhash/J !! Watts !! Clock !! SP !! Comment
|-
| ION || 1.8 || 0.067 || 27 ||  || 16 || poclbm;  power consumption incl. CPU
|-
| 8200 mGPU || 1.2 || || || 1200 || 16 || 128 MB shared memory, "poclbm -w 128 -f 0"
|-
| 8400 GS || 2.3 || || ||  ||  || "poclbm -w 128"
|-
|}

I'm looking for a way to import such data to a Python Pandas DataFrame

like image 988
working4coins Avatar asked Mar 30 '13 22:03

working4coins


Video Answer


2 Answers

Here's a solution using py-wikimarkup and PyQuery to extract all tables as pandas DataFrames from a wikimarkup string, ignoring non-table content.

import wikimarkup
import pandas as pd
from pyquery import PyQuery

def get_tables(wiki):
    html = PyQuery(wikimarkup.parse(wiki))
    frames = []
    for table in html('table'):
        data = [[x.text.strip() for x in row]
                for row in table.getchildren()]
        df = pd.DataFrame(data[1:], columns=data[0])
        frames.append(df)
    return frames

Given the following input,

wiki = """
=Title=

Description.

{| class="wikitable sortable"
|-
! Model !! Mhash/s !! Mhash/J !! Watts !! Clock !! SP !! Comment
|-
| ION || 1.8 || 0.067 || 27 ||  || 16 || poclbm;  power consumption incl. CPU
|-
| 8200 mGPU || 1.2 || || || 1200 || 16 || 128 MB shared memory, "poclbm -w 128 -f 0"
|-
| 8400 GS || 2.3 || || || || || "poclbm -w 128"
|-
|}

{| class="wikitable sortable"
|-
! A !! B !! C
|-
| 0
| 1
| 2
|-
| 3
| 4
| 5
|}
"""

get_tables returns the following DataFrames.

       Model Mhash/s Mhash/J Watts Clock  SP                                     Comment
0        ION     1.8   0.067    27        16        poclbm;  power consumption incl. CPU
1  8200 mGPU     1.2                1200  16  128 MB shared memory, "poclbm -w 128 -f 0"
2    8400 GS     2.3                                                     "poclbm -w 128"

 

   A  B  C
0  0  1  2
1  3  4  5
like image 122
Garrett Avatar answered Oct 03 '22 20:10

Garrett


You can use pandas directly. Something like this...

pandas.read_html(url, attrs={"class": "wikitable"})

like image 44
cavs Avatar answered Oct 03 '22 21:10

cavs