Possible Duplicate:
Beautiful Soup cannot find a CSS class if the object has other classes, too
I'm using BeautifulSoup to find tables
in the HTML. The problem I am currently running into is the use of spaces in the class
attribute. If my HTML reads <html><table class="wikitable sortable">blah</table></html>
, I can't seem to extract it with the following (where I was to be able to find tables
with both wikipedia
and wikipedia sortable
for the class
):
BeautifulSoup(html).findAll(attrs={'class':re.compile("wikitable( sortable)?")})
This will find the table if my HTML is just <html><table class="wikitable">blah</table></html>
though. Likewise, I have tried using "wikitable sortable"
in my regular expression, and that won't match either. Any ideas?
The pattern match will also fail if wikitable
appears after another CSS class, as in class="something wikitable other"
, so if you want all tables whose class attribute contains the class wikitable
, you need a pattern that accepts more possibilities:
html = '''<html><table class="sortable wikitable other">blah</table>
<table class="wikitable sortable">blah</table>
<table class="wikitable"><blah></table></html>'''
tree = BeautifulSoup(html)
for node in tree.findAll(attrs={'class': re.compile(r".*\bwikitable\b.*")}):
print node
Result:
<table class="sortable wikitable other">blah</table>
<table class="wikitable sortable">blah</table>
<table class="wikitable"><blah></blah></table>
Just for the record, I don't use BeautifulSoup, and prefer to use lxml, as others have mentioned.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With