The problem
I'm trying to parse an HTML table with rowspans in it, as in, I'm trying to parse my college schedule.
I'm running into the problem where if the last row contains a rowspan, the next row is missing a TD where the rowspan is now that TD that is missing.
I have no clue how to account for this and I hope to be able to parse this schedule.
What I tried
Pretty much everything I can think of.
The result I get
[ { 'blok_eind': 4, 'blok_start': 3, 'dag': 4, # Should be 5 'leraar': 'DOODF000', 'lokaal': 'ALK C212', 'vak': 'PROJ-T', }, ]
As you can see, there's a vak
key with the value PROJ-T
in the output snippet above, dag
is 4
while it's supposed to be 5
(a.k.a Friday/Vrijdag), as seen here:
The result I want
A Python dict() that looks like the one posted above, but with the right value
Where:
day
/dag
is an int from 1~5 representing Monday~Fridayblock_start
/blok_start
is an int that represents when the course starts (Time block, left side of table)block_end
/blok_eind
is an int that represent in what block the course endsclassroom
/lokaal
is the classroom's code the course is inteacher
/leraar
is the teacher's ID course
/vak
is the ID of the courseBasic HTML Structure for above data
<center> <table> <tr> <td> <table> <tbody> <tr> <td> <font> TEACHER-ID </font> </td> <td> <font> <b> CLASSROOM ID </b> </font> </td> </tr> <tr> <td> <font> COURSE ID </font> </td> </tr> </tbody> </table> </td> </tr> </table> </center>
The code
HTML
<CENTER><font size="3" face="Arial" color="#000000"> <BR></font> <font size="6" face="Arial" color="#0000FF"> 16AO4EIO1B </font> <font size="4" face="Arial"> IO1B </font> <BR> <TABLE border="3" rules="all" cellpadding="1" cellspacing="1"> <TR> <TD align="center"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 align="center" nowrap="1"> <TABLE> <TR> <TD align="center" nowrap=1><font size="2" face="Arial" color="#000000"> Maandag 29-08 </font> </TD> </TR> </TABLE> </TD> <TD colspan=12 align="center" nowrap="1"> <TABLE> <TR> <TD align="center" nowrap=1><font size="2" face="Arial"> Dinsdag 30-08 </font> </TD> </TR> </TABLE> </TD> <TD colspan=12 align="center" nowrap="1"> <TABLE> <TR> <TD align="center" nowrap=1><font size="2" face="Arial"> Woensdag 31-08 </font> </TD> </TR> </TABLE> </TD> <TD colspan=12 align="center" nowrap="1"> <TABLE> <TR> <TD align="center" nowrap=1><font size="2" face="Arial"> Donderdag 01-09 </font> </TD> </TR> </TABLE> </TD> <TD colspan=12 align="center" nowrap="1"> <TABLE> <TR> <TD align="center" nowrap=1><font size="2" face="Arial"> Vrijdag 02-09 </font> </TD> </TR> </TABLE> </TD> </TR> <TR> <TD rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial"> <B>1</B> </font> </TD> <TD align="center" nowrap=1><font size="2" face="Arial"> 8:30 </font> </TD> </TR> <TR> <TD align="center" nowrap=1><font size="2" face="Arial"> 9:20 </font> </TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=4 align="center" nowrap="1"> <TABLE> <TR> <TD width="50%" nowrap=1><font size="2" face="Arial"> BLEEJ002 </font> </TD> <TD width="50%" nowrap=1><font size="2" face="Arial"> <B>ALK B021</B> </font> </TD> </TR> <TR> <TD colspan="2" width="50%" nowrap=1><font size="2" face="Arial"> WEBD </font> </TD> </TR> </TABLE> </TD> </TR> <TR> </TR> <TR> <TD rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial"> <B>2</B> </font> </TD> <TD align="center" nowrap=1><font size="2" face="Arial"> 9:20 </font> </TD> </TR> <TR> <TD align="center" nowrap=1><font size="2" face="Arial"> 10:10 </font> </TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=4 align="center" nowrap="1"> <TABLE> <TR> <TD width="50%" nowrap=1><font size="2" face="Arial"> BLEEJ002 </font> </TD> <TD width="50%" nowrap=1><font size="2" face="Arial"> <B>ALK B021B</B> </font> </TD> </TR> <TR> <TD colspan="2" width="50%" nowrap=1><font size="2" face="Arial"> WEBD </font> </TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> </TR> <TR> </TR> <TR> <TD rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial"> <B>3</B> </font> </TD> <TD align="center" nowrap=1><font size="2" face="Arial"> 10:25 </font> </TD> </TR> <TR> <TD align="center" nowrap=1><font size="2" face="Arial"> 11:15 </font> </TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=4 align="center" nowrap="1"> <TABLE> <TR> <TD width="50%" nowrap=1><font size="2" face="Arial"> DOODF000 </font> </TD> <TD width="50%" nowrap=1><font size="2" face="Arial"> <B>ALK C212</B> </font> </TD> </TR> <TR> <TD colspan="2" width="50%" nowrap=1><font size="2" face="Arial"> PROJ-T </font> </TD> </TR> </TABLE> </TD> </TR> <TR> </TR> <TR> <TD rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial"> <B>4</B> </font> </TD> <TD align="center" nowrap=1><font size="2" face="Arial"> 11:15 </font> </TD> </TR> <TR> <TD align="center" nowrap=1><font size="2" face="Arial"> 12:05 </font> </TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=4 align="center" nowrap="1"> <TABLE> <TR> <TD width="50%" nowrap=1><font size="2" face="Arial"> BLEEJ002 </font> </TD> <TD width="50%" nowrap=1><font size="2" face="Arial"> <B>ALK B021B</B> </font> </TD> </TR> <TR> <TD colspan="2" width="50%" nowrap=1><font size="2" face="Arial"> MENT </font> </TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> </TR> <TR> </TR> <TR> <TD rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial"> <B>5</B> </font> </TD> <TD align="center" nowrap=1><font size="2" face="Arial"> 12:05 </font> </TD> </TR> <TR> <TD align="center" nowrap=1><font size="2" face="Arial"> 12:55 </font> </TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> </TR> <TR> </TR> <TR> <TD rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial"> <B>6</B> </font> </TD> <TD align="center" nowrap=1><font size="2" face="Arial"> 12:55 </font> </TD> </TR> <TR> <TD align="center" nowrap=1><font size="2" face="Arial"> 13:45 </font> </TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=4 align="center" nowrap="1"> <TABLE> <TR> <TD width="50%" nowrap=1><font size="2" face="Arial"> JONGJ003 </font> </TD> <TD width="50%" nowrap=1><font size="2" face="Arial"> <B>ALK B008</B> </font> </TD> </TR> <TR> <TD colspan="2" width="50%" nowrap=1><font size="2" face="Arial"> BURG </font> </TD> </TR> </TABLE> </TD> </TR> <TR> </TR> <TR> <TD rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial"> <B>7</B> </font> </TD> <TD align="center" nowrap=1><font size="2" face="Arial"> 13:45 </font> </TD> </TR> <TR> <TD align="center" nowrap=1><font size="2" face="Arial"> 14:35 </font> </TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=4 align="center" nowrap="1"> <TABLE> <TR> <TD width="50%" nowrap=1><font size="2" face="Arial"> FLUIP000 </font> </TD> <TD width="50%" nowrap=1><font size="2" face="Arial"> <B>ALK B004</B> </font> </TD> </TR> <TR> <TD colspan="2" width="50%" nowrap=1><font size="2" face="Arial"> ICT algemeen Prakti </font> </TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> </TR> <TR> </TR> <TR> <TD rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial"> <B>8</B> </font> </TD> <TD align="center" nowrap=1><font size="2" face="Arial"> 14:50 </font> </TD> </TR> <TR> <TD align="center" nowrap=1><font size="2" face="Arial"> 15:40 </font> </TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=4 align="center" nowrap="1"> <TABLE> <TR> <TD width="50%" nowrap=1><font size="2" face="Arial"> KOOLE000 </font> </TD> <TD width="50%" nowrap=1><font size="2" face="Arial"> <B>ALK B008</B> </font> </TD> </TR> <TR> <TD colspan="2" width="50%" nowrap=1><font size="2" face="Arial"> NED </font> </TD> </TR> </TABLE> </TD> </TR> <TR> </TR> <TR> <TD rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial"> <B>9</B> </font> </TD> <TD align="center" nowrap=1><font size="2" face="Arial"> 15:40 </font> </TD> </TR> <TR> <TD align="center" nowrap=1><font size="2" face="Arial"> 16:30 </font> </TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> </TR> <TR> </TR> <TR> <TD rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial"> <B>10</B> </font> </TD> <TD align="center" nowrap=1><font size="2" face="Arial"> 16:30 </font> </TD> </TR> <TR> <TD align="center" nowrap=1><font size="2" face="Arial"> 17:20 </font> </TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> </TR> <TR> </TR> </TABLE> <TABLE cellspacing="1" cellpadding="1"> <TR> <TD valign=bottom> <font size="4" face="Arial" color="#0000FF"></TR></TABLE><font size="3" face="Arial"> Periode1 29-08-2016 (35) - 04-09-2016 (35) G r u b e r & P e t t e r s S o f t w a r e </font></CENTER>
Python
from pprint import pprint from bs4 import BeautifulSoup import requests r = requests.get("http://rooster.horizoncollege.nl/rstr/ECO/AMR/400-ECO/Roosters/36" "/c/c00025.htm") daytable = { 1: "Maandag", 2: "Dinsdag", 3: "Woensdag", 4: "Donderdag", 5: "Vrijdag" } timetable = { 1: ("8:30", "9:20"), 2: ("9:20", "10:10"), 3: ("10:25", "11:15"), 4: ("11:15", "12:05"), 5: ("12:05", "12:55"), 6: ("12:55", "13:45"), 7: ("13:45", "14:35"), 8: ("14:50", "15:40"), 9: ("15:40", "16:30"), 10: ("16:30", "17:20"), } page = BeautifulSoup(r.content, "lxml") roster = [] big_rows = 2 last_row_big = False # There are 10 blocks, each made up out of 2 TR's, run through them for block_count in range(2, 22, 2): # There are 5 days, first column is not data we want for day in range(2, 7): dayroster = { "dag": 0, "blok_start": 0, "blok_eind": 0, "lokaal": "", "leraar": "", "vak": "" } # This selector provides the classroom table_bold = page.select( "html > body > center > table > tr:nth-of-type(" + str(block_count) + ") > td:nth-of-type(" + str( day) + ") > table > tr > td > font > b") # This selector provides the teacher's code and the course ID table = page.select( "html > body > center > table > tr:nth-of-type(" + str(block_count) + ") > td:nth-of-type(" + str( day) + ") > table > tr > td > font") # This gets the rowspan on the current row and column rowspan = page.select( "html > body > center > table > tr:nth-of-type(" + str(block_count) + ") > td:nth-of-type(" + str( day) + ")") try: if table or table_bold and rowspan[0].attrs.get("rowspan") == "4": last_row_big = True # Setting end of class dayroster["blok_eind"] = (block_count // 2) + 1 else: last_row_big = False # Setting end of class dayroster["blok_eind"] = (block_count // 2) except IndexError: pass if table_bold: x = table_bold[0] # Classroom ID dayroster["lokaal"] = x.contents[0] if table: iter = 0 for x in table: content = x.contents[0].lstrip("\r\n").rstrip("\r\n") # Cell has data if content != "": # Set start of class dayroster["blok_start"] = block_count // 2 # Set day of class dayroster["dag"] = day - 1 if iter == 0: # Teacher ID dayroster["leraar"] = content elif iter == 1: # Course ID dayroster["vak"] = content iter += 1 if table or table_bold: # Store the data roster.append(dayroster) # Remove duplicates seen = set() new_l = [] for d in roster: t = tuple(d.items()) if t not in seen: seen.add(t) new_l.append(d) pprint(new_l)
To parse the table, we'd like to grab a row, take the data from its columns, and then move on to the next row ad nauseam. In the next bit of code, we define a website that is simply the HTML for a table. We load it into BeautifulSoup and parse it, returning a pandas data frame of the contents.
The rowspan attribute in HTML specifies the number of rows a cell should span. That is if a row spans two rows, it means it will take up the space of two rows in that table. It allows the single table cell to span the height of more than one cell or row.
Of course, you can mix colspan and rowspan to get a range of various tables. Example 4-13 demonstrates a mix of column and row spanning.
Definition and UsageThe rowspan attribute specifies the number of rows a cell should span.
You'll have to track the rowspans on previous rows, one per column.
You could do this simply by copying the integer value of a rowspan into a dictionary, and subsequent rows decrement the rowspan value until it drops to 1
(or we could store the integer value minus 1 and drop to 0
for ease of coding). Then you can adjust subsequent table counts based on preceding rowspans.
Your table complicates this a little by using a default span of size 2, incrementing in steps of two, but that can easily be brought back to manageable numbers by dividing by 2.
Rather than use massive CSS selectors, select just the table rows and we'll iterate over those:
roster = [] rowspans = {} # track rowspanning cells # every second row in the table rows = page.select('html > body > center > table > tr')[1:21:2] for block, row in enumerate(rows, 1): # take direct child td cells, but skip the first cell: daycells = row.select('> td')[1:] rowspan_offset = 0 for daynum, daycell in enumerate(daycells, 1): # rowspan handling; if there is a rowspan here, adjust to find correct position daynum += rowspan_offset while rowspans.get(daynum, 0): rowspan_offset += 1 rowspans[daynum] -= 1 daynum += 1 # now we have a correct day number for this cell, adjusted for # rowspanning cells. # update the rowspan accounting for this cell rowspan = (int(daycell.get('rowspan', 2)) // 2) - 1 if rowspan: rowspans[daynum] = rowspan texts = daycell.select("table > tr > td > font") if texts: # class info found teacher, classroom, course = (c.get_text(strip=True) for c in texts) roster.append({ 'blok_start': block, 'blok_eind': block + rowspan, 'dag': daynum, 'leraar': teacher, 'lokaal': classroom, 'vak': course }) # days that were skipped at the end due to a rowspan while daynum < 5: daynum += 1 if rowspans.get(daynum, 0): rowspans[daynum] -= 1
This produces correct output:
[{'blok_eind': 2, 'blok_start': 1, 'dag': 5, 'leraar': u'BLEEJ002', 'lokaal': u'ALK B021', 'vak': u'WEBD'}, {'blok_eind': 3, 'blok_start': 2, 'dag': 3, 'leraar': u'BLEEJ002', 'lokaal': u'ALK B021B', 'vak': u'WEBD'}, {'blok_eind': 4, 'blok_start': 3, 'dag': 5, 'leraar': u'DOODF000', 'lokaal': u'ALK C212', 'vak': u'PROJ-T'}, {'blok_eind': 5, 'blok_start': 4, 'dag': 3, 'leraar': u'BLEEJ002', 'lokaal': u'ALK B021B', 'vak': u'MENT'}, {'blok_eind': 7, 'blok_start': 6, 'dag': 5, 'leraar': u'JONGJ003', 'lokaal': u'ALK B008', 'vak': u'BURG'}, {'blok_eind': 8, 'blok_start': 7, 'dag': 3, 'leraar': u'FLUIP000', 'lokaal': u'ALK B004', 'vak': u'ICT algemeen Prakti'}, {'blok_eind': 9, 'blok_start': 8, 'dag': 5, 'leraar': u'KOOLE000', 'lokaal': u'ALK B008', 'vak': u'NED'}]
Moreover, this code will continue to work even if courses span more than 2 blocks, or just one block; any rowspan size is supported.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With