Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Beautiful Soup cannot find a CSS class if the object has other classes, too

if a page has <div class="class1"> and <p class="class1">, then soup.findAll(True, 'class1') will find them both.

If it has <p class="class1 class2">, though, it will not be found. How do I find all objects with a certain class, regardless of whether they have other classes, too?

like image 423
endolith Avatar asked Aug 07 '09 03:08

endolith


4 Answers

Unfortunately, BeautifulSoup treats this as a class with a space in it 'class1 class2' rather than two classes ['class1','class2']. A workaround is to use a regular expression to search for the class instead of a string.

This works:

soup.findAll(True, {'class': re.compile(r'\bclass1\b')})
like image 159
endolith Avatar answered Nov 16 '22 07:11

endolith


Just in case anybody comes across this question. BeautifulSoup now supports this:

Python 2.7.5 (default, May 15 2013, 22:43:36) [MSC v.1500 32 bit (Intel)]
Type "copyright", "credits" or "license" for more information.

In [1]: import bs4

In [2]: soup = bs4.BeautifulSoup('<div class="foo bar"></div>')

In [3]: soup(attrs={'class': 'bar'})
Out[3]: [<div class="foo bar"></div>]

Also, you don't have to type findAll anymore.

like image 21
Kugel Avatar answered Nov 16 '22 05:11

Kugel


You should use lxml. It works with multiple class values separated by spaces ('class1 class2').

Despite its name, lxml is also for parsing and scraping HTML. It's much, much faster than BeautifulSoup, and it even handles "broken" HTML better than BeautifulSoup (their claim to fame). It has a compatibility API for BeautifulSoup too if you don't want to learn the lxml API.

Ian Bicking agrees and prefers lxml over BeautifulSoup.

There's no reason to use BeautifulSoup anymore, unless you're on Google App Engine or something where anything not purely Python isn't allowed.

You can even use CSS selectors with lxml, so it's far easier to use than BeautifulSoup. Try playing around with it in an interactive Python console.

like image 11
aehlke Avatar answered Nov 16 '22 06:11

aehlke


It’s very useful to search for a tag that has a certain CSS class, but the name of the CSS attribute, “class”, is a reserved word in Python. Using class as a keyword argument will give you a syntax error. As of Beautiful Soup 4.1.2, you can search by CSS class using the keyword argument class_:

Like:

soup.find_all("a", class_="class1")
like image 2
alan_wang Avatar answered Nov 16 '22 07:11

alan_wang