Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Beautiful Soup for HTML tables that lack </td> tags

I'm struggling with parsing some flaky HTML tables down to lists with Beautiful Soup. The tables in question lack a </td> tag.

Using the following code (not the real tables I'm parsing, but functionally similar):

import bs4
test = "<table> <tr><td>1<td>2<td>3</tr> <tr><td>1<td>2<td>3</tr> </table>"
def walk_table2(text):
    "Take an HTML table and spit out a list of lists (of entries in a row)."
    soup = bs4.BeautifulSoup(text)
    return [[x for x in row.findAll('td')] for row in soup.findAll('tr')]

print walk_table2(test)

Gives me:

[[<td>1<td>2<td>3</td></td></td>, <td>2<td>3</td></td>, <td>3</td>], [<td>4<td>5<td>6</td></td></td>, <td>5<td>6</td></td>, <td>6</td>]]

Rather than the expected:

[[<td>1</td>, <td>2</td>, <td>3</td>], [<td>1</td>, <td>2</td>, <td>3</td>]]

It seems that the lxml parser that Beautiful Soup is using decides to add the </td> tag before the next instance of </tr> rather than the next instance of <td>.

At this point, I'm wondering if there a good option to make the parser place the ending td tags in the correct location, or if it would be easier to use a regular expression to place them manually before tossing the string into BeautifulSoup... Any thoughts? Thanks in advance!

like image 861
user1607568 Avatar asked Aug 17 '12 18:08

user1607568


3 Answers

You're seeing decisions made by Python's built-in HTML parser. If you don't like the way that parser does things, you can tell Beautiful Soup to use a different parser. The html5lib parser and the lxml parser both give the result you want:

>>> soup = bs4.BeautifulSoup(test, "lxml")
>>> [[x for x in row.findAll('td')] for row in soup.findAll('tr')]
[[<td>1</td>, <td>2</td>, <td>3</td>], [<td>1</td>, <td>2</td>, <td>3</td>]]

>>> soup = bs4.BeautifulSoup(test, "html5lib")
>>> [[x for x in row.findAll('td')] for row in soup.findAll('tr')]
[[<td>1</td>, <td>2</td>, <td>3</td>], [<td>1</td>, <td>2</td>, <td>3</td>]]
like image 103
Leonard Richardson Avatar answered Oct 12 '22 03:10

Leonard Richardson


This sounds like a BeautifulSoup bug to me. I found this page detailing why there are regressions in BS 3.1 from 3.0.8 (including "'bad end tag' errors") which suggest that, for parsing bad HTML, one option would be to jump back several versions. That said, the page says it's been superseded and now exists only for historical reference. It's unclear however exactly how much BS4 resolves the issues introduced in BS 3.1 - at the very least, it couldn't hurt to try the older version.

like image 28
dimo414 Avatar answered Oct 12 '22 02:10

dimo414


A patchy fix to get you through this particular pinch:

Massage the incoming data with a regex (this is VERY brittle, and I know how stackoverflow feels about regexes and html but C'MON, just this one time...)

import re
r1 = re.compile('(?<!\<tr\>)\<td', re.IGNORECASE)
r2 = re.compile('\<\/tr>', re.IGNORECASE)
test = "<table> <tr><td>1<td>2<td>3</tr> <tr><td>1<td>2<td>3</tr> </table>"
test = r1.sub('</td><td', test)
test = r2.sub('</td></tr>', test)
print test

Oh, and test afterwards:

<table> <tr><td>1</td><td>2</td><td>3</td></tr> <tr><td>1</td><td>2</td><td>3</td></tr> </table>
like image 1
chucksmash Avatar answered Oct 12 '22 03:10

chucksmash