Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BeautifulSoup and Searching By Class [duplicate]

Possible Duplicate:
Beautiful Soup cannot find a CSS class if the object has other classes, too

I'm using BeautifulSoup to find tables in the HTML. The problem I am currently running into is the use of spaces in the class attribute. If my HTML reads <html><table class="wikitable sortable">blah</table></html>, I can't seem to extract it with the following (where I was to be able to find tables with both wikipedia and wikipedia sortable for the class):

BeautifulSoup(html).findAll(attrs={'class':re.compile("wikitable( sortable)?")})

This will find the table if my HTML is just <html><table class="wikitable">blah</table></html> though. Likewise, I have tried using "wikitable sortable" in my regular expression, and that won't match either. Any ideas?

like image 225
cryptic_star Avatar asked May 04 '11 22:05

cryptic_star


1 Answers

The pattern match will also fail if wikitable appears after another CSS class, as in class="something wikitable other", so if you want all tables whose class attribute contains the class wikitable, you need a pattern that accepts more possibilities:

html = '''<html><table class="sortable wikitable other">blah</table>
<table class="wikitable sortable">blah</table>
<table class="wikitable"><blah></table></html>'''

tree = BeautifulSoup(html)
for node in tree.findAll(attrs={'class': re.compile(r".*\bwikitable\b.*")}):
    print node

Result:

<table class="sortable wikitable other">blah</table>
<table class="wikitable sortable">blah</table>
<table class="wikitable"><blah></blah></table>

Just for the record, I don't use BeautifulSoup, and prefer to use lxml, as others have mentioned.

like image 129
samplebias Avatar answered Sep 28 '22 13:09

samplebias