If the row has rowspan element , how to make the row correspond to the table as in wikipedia page.
from bs4 import BeautifulSoup
import urllib2
from lxml.html import fromstring
import re
import csv
import pandas as pd
wiki = "http://en.wikipedia.org/wiki/List_of_England_Test_cricket_records"
header = {'User-Agent': 'Mozilla/5.0'} #Needed to prevent 403 error on Wikipedia
req = urllib2.Request(wiki,headers=header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
try:
table = soup.find_all('table')[6]
except AttributeError as e:
print 'No tables found, exiting'
try:
first = table.find_all('tr')[0]
except AttributeError as e:
print 'No table row found, exiting'
try:
allRows = table.find_all('tr')[1:-1]
except AttributeError as e:
print 'No table row found, exiting'
headers = [header.get_text() for header in first.find_all(['th', 'td'])]
results = [[data.get_text() for data in row.find_all(['th', 'td'])] for row in allRows]
df = pd.DataFrame(data=results, columns=headers)
df
I get the table as the output.. but for tables where the row contains rowspan - i get table as follows -
Usage: It can be used with <td> and <th> element in an HTML Table. Attribute Values: It contains a value i.e number Which specify the number of rows that a table cell should span. <td>: The rowspan attribute when used with <td> tag determines the number of standard cells it should span.
Definition and Usage The rowspan attribute specifies the number of rows a cell should span.
The HTML <th> rowspan Attribute is used to specify how many numbers of header row cell will span. Attribute Value: number: It contains the numeric value which specifies the number of rows the header cell should span.
You can use rowspan="n" on a td element to make it span n rows, and colspan="m" on a td element to make it span m columns. Looks like your first td needs a rowspan="2" and the next td needs a colspan="4" . Save this answer.
None of the parsers found across stackoverflow or across the web worked for me - they all parsed my tables from Wikipedia incorrectly. So here you go, a parser that actually works and is simple. Cheers.
Define the parser functions:
def pre_process_table(table):
"""
INPUT:
1. table - a bs4 element that contains the desired table: ie <table> ... </table>
OUTPUT:
a tuple of:
1. rows - a list of table rows ie: list of <tr>...</tr> elements
2. num_rows - number of rows in the table
3. num_cols - number of columns in the table
Options:
include_td_head_count - whether to use only th or th and td to count number of columns (default: False)
"""
rows = [x for x in table.find_all('tr')]
num_rows = len(rows)
# get an initial column count. Most often, this will be accurate
num_cols = max([len(x.find_all(['th','td'])) for x in rows])
# sometimes, the tables also contain multi-colspan headers. This accounts for that:
header_rows_set = [x.find_all(['th', 'td']) for x in rows if len(x.find_all(['th', 'td']))>num_cols/2]
num_cols_set = []
for header_rows in header_rows_set:
num_cols = 0
for cell in header_rows:
row_span, col_span = get_spans(cell)
num_cols+=len([cell.getText()]*col_span)
num_cols_set.append(num_cols)
num_cols = max(num_cols_set)
return (rows, num_rows, num_cols)
def get_spans(cell):
"""
INPUT:
1. cell - a <td>...</td> or <th>...</th> element that contains a table cell entry
OUTPUT:
1. a tuple with the cell's row and col spans
"""
if cell.has_attr('rowspan'):
rep_row = int(cell.attrs['rowspan'])
else: # ~cell.has_attr('rowspan'):
rep_row = 1
if cell.has_attr('colspan'):
rep_col = int(cell.attrs['colspan'])
else: # ~cell.has_attr('colspan'):
rep_col = 1
return (rep_row, rep_col)
def process_rows(rows, num_rows, num_cols):
"""
INPUT:
1. rows - a list of table rows ie <tr>...</tr> elements
OUTPUT:
1. data - a Pandas dataframe with the html data in it
"""
data = pd.DataFrame(np.ones((num_rows, num_cols))*np.nan)
for i, row in enumerate(rows):
try:
col_stat = data.iloc[i,:][data.iloc[i,:].isnull()].index[0]
except IndexError:
print(i, row)
for j, cell in enumerate(row.find_all(['td', 'th'])):
rep_row, rep_col = get_spans(cell)
#print("cols {0} to {1} with rep_col={2}".format(col_stat, col_stat+rep_col, rep_col))
#print("\trows {0} to {1} with rep_row={2}".format(i, i+rep_row, rep_row))
#find first non-na col and fill that one
while any(data.iloc[i,col_stat:col_stat+rep_col].notnull()):
col_stat+=1
data.iloc[i:i+rep_row,col_stat:col_stat+rep_col] = cell.getText()
if col_stat<data.shape[1]-1:
col_stat+=rep_col
return data
def main(table):
rows, num_rows, num_cols = pre_process_table(table)
df = process_rows(rows, num_rows, num_cols)
return(df)
Here's an example of how one would use the above code on this Wisconsin data. Suppose it's already in a bs4
soup then...
## Find tables on the page and locate the desired one:
tables = soup.findAll("table", class_='wikitable')
## I want table 3 or the one that contains years 2000-2018
table = tables[3]
## run the above functions to extract the data
rows, num_rows, num_cols = pre_process_table(table)
df = process_rows(rows, num_rows, num_cols)
My parser above will accurately parse tables such as the ones here, while all others fail to recreate the tables at numerous points.
There may be a simpler solution to the above issue if it's a pretty well-formatted table with rowspan
attributes. Pandas
has a fairly robust read_html
function that can parse the provided html
tables and seems to handle rowspan
fairly well(couldn't parse the Wisconsin stuff). fillna(method='ffill')
can then populate the unpopulated rows. Note that this does not necessarily work across column spaces. Also note that cleanup will be necessary after.
Consider the html code:
s = """<table width="100%" border="1">
<tr>
<td rowspan="1">one</td>
<td rowspan="2">two</td>
<td rowspan="3">three</td>
</tr>
<tr><td>"4"</td></tr>
<tr>
<td>"55"</td>
<td>"99"</td>
</tr>
</table>
"""
In order to process it into the requested output, just do:
In [16]: df = pd.read_html(s)[0]
In [29]: df
Out[29]:
0 1 2
0 one two three
1 "4" NaN NaN
2 "55" "99" NaN
Then to fill the NAs,
In [30]: df.fillna(method='ffill')
Out[30]:
0 1 2
0 one two three
1 "4" two three
2 "55" "99" three
pandas >= 0.24.0 understands colspan
and rowspan
attributes, as documented in the
release
notes. To extract the wikipage table that were giving you issues previously, the following works.
import pandas as pd
# Extract all tables from the wikipage
dfs = pd.read_html("http://en.wikipedia.org/wiki/List_of_England_Test_cricket_records")
# The table referenced above is the 7th on the wikipage
df = dfs[6]
# The last row is just the date of the last update
df = df.iloc[:-1]
Out:
Rank Victories Opposition Most recent venue Date
0 1 6 South Africa Lord's, London, England 21 June 1951
1 =2 4 India Wankhede Stadium, Mumbai, India 23 November 2012
2 =2 4 West Indies Lord's, London, England 6 May 2009
3 4 3 Australia Sydney Cricket Ground, Sydney, Australia 2 December 1932
4 5 2 Pakistan Trent Bridge, Nottingham, England 10 August 1967
5 6 1 Sri Lanka Old Trafford Cricket Ground, Manchester, England 13 June 2002
The problem due to following case , as you know,
html content:
<tr>
<td rowspan="2">2=</td>
<td>West Indies</td>
<td>4</td>
<td>Lord's</td>
<td>2009</td>
</tr>
<tr>
<td style="text-align:left;">India</td>
<td>4</td>
<td>Mumbai</td>
<td>2012</td>
</tr>
so when td
have rowspan
attribute then consider that same td
vaulue is repeated for next tr
at same level and the value of rowspan
means for next number of tr
tags.
rowspan
information and save in variable. Save sequence number of tr
tag , sequence number of td
tag , value of rowspan
i.e. how many tr
tags have same td
, the text value of td
.tr
according to above method.Note:: checked only given test case. Need to check some more test case.
code:
from bs4 import BeautifulSoup
import urllib2
from lxml.html import fromstring
import re
import csv
import pandas as pd
wiki = "http://en.wikipedia.org/wiki/List_of_England_Test_cricket_records"
header = {'User-Agent': 'Mozilla/5.0'} #Needed to prevent 403 error on Wikipedia
req = urllib2.Request(wiki,headers=header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
table = soup.find_all('table')[6]
tmp = table.find_all('tr')
first = tmp[0]
allRows = tmp[1:-1]
#table.find_all('tr')[1:-1]
headers = [header.get_text() for header in first.find_all('th')]
results = [[data.get_text() for data in row.find_all('td')] for row in allRows]
#<td rowspan="2">2=</td>
# list of tuple (Level of tr, Level of td, total Count, Text Value)
#e.g.
#[(1, 0, 2, u'2=')]
# (<tr> is 1 , td sequence in tr is 0, reapted 2 times , value is 2=)
rowspan = []
for no, tr in enumerate(allRows):
tmp = []
for td_no, data in enumerate(tr.find_all('td')):
print data.has_key("rowspan")
if data.has_key("rowspan"):
rowspan.append((no, td_no, int(data["rowspan"]), data.get_text()))
if rowspan:
for i in rowspan:
# tr value of rowspan in present in 1th place in results
for j in xrange(1, i[2]):
#- Add value in next tr.
results[i[0]+j].insert(i[1], i[3])
df = pd.DataFrame(data=results, columns=headers)
print df
output:
Rank Opponent No. wins Most recent venue Season
0 1 South Africa 6 Lord's 1951
1 2= West Indies 4 Lord's 2009
2 2= India 4 Mumbai 2012
3 4 Australia 3 Sydney 1932
4 5 Pakistan 2 Trent Bridge 1967
5 6 Sri Lanka 1 Old Trafford 2002
working to table 10 also
Rank Hundreds Player Matches Innings Average
0 1 25 Alastair Cook 107 191 45.61
1 2 23 Kevin Pietersen 104 181 47.28
2 3 22 Colin Cowdrey 114 188 44.07
3 3 22 Wally Hammond 85 140 58.46
4 3 22 Geoffrey Boycott 108 193 47.72
5 6 21 Andrew Strauss 100 178 40.91
6 6 21 Ian Bell 103 178 45.30
7 8= 20 Ken Barrington 82 131 58.67
8 8= 20 Graham Gooch 118 215 42.58
9 10 19 Len Hutton 79 138 56.67
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With