Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BeautifulSoup webscraping find_all( ): finding exact match

I'm using Python and BeautifulSoup for web scraping.

Lets say I have the following html code to scrape:

<body>     <div class="product">Product 1</div>     <div class="product">Product 2</div>     <div class="product special">Product 3</div>     <div class="product special">Product 4</div> </body> 

Using BeautifulSoup, I want to find ONLY the products with the attribute class="product" (only Product 1 and 2), not the 'special' products

If I do the following:

result = soup.find_all('div', {'class': 'product'}) 

the result includes ALL the products (1,2,3, and 4).

What should I do to find products whose class EXACTLY matches 'product'??


The Code I ran:

from bs4 import BeautifulSoup import re  text = """ <body>     <div class="product">Product 1</div>     <div class="product">Product 2</div>     <div class="product special">Product 3</div>     <div class="product special">Product 4</div> </body>"""  soup = BeautifulSoup(text) result = soup.findAll(attrs={'class': re.compile(r"^product$")}) print result 

Output:

[<div class="product">Product 1</div>, <div class="product">Product 2</div>, <div class="product special">Product 3</div>, <div class="product special">Product 4</div>] 
like image 359
user2436815 Avatar asked Mar 29 '14 04:03

user2436815


People also ask

How do I find a specific element with BeautifulSoup?

BeautifulSoup has a limited support for CSS selectors, but covers most commonly used ones. Use select() method to find multiple elements and select_one() to find a single element.

What is Find () method in BeautifulSoup?

find() method The find method is used for finding out the first tag with the specified name or id and returning an object of type bs4. Example: For instance, consider this simple HTML webpage having different paragraph tags.

What does Soup Find_all return?

find_all returns an object of ResultSet which offers index based access to the result of found occurrences and can be printed using a for loop. Unwanted values These are not desired most of the time. So, attributes like id , class , or value are used to further refine the search.


1 Answers

In BeautifulSoup 4, the class attribute (and several other attributes, such as accesskey and the headers attribute on table cell elements) is treated as a set; you match against individual elements listed in the attribute. This follows the HTML standard.

As such, you cannot limit the search to just one class.

You'll have to use a custom function here to match against the class instead:

result = soup.find_all(lambda tag: tag.name == 'div' and                                     tag.get('class') == ['product']) 

I used a lambda to create an anonymous function; each tag is matched on name (must be 'div'), and the class attribute must be exactly equal to the list ['product']; e.g. have just the one value.

Demo:

>>> from bs4 import BeautifulSoup >>> text = """ ... <body> ...     <div class="product">Product 1</div> ...     <div class="product">Product 2</div> ...     <div class="product special">Product 3</div> ...     <div class="product special">Product 4</div> ... </body>""" >>> soup = BeautifulSoup(text) >>> soup.find_all(lambda tag: tag.name == 'div' and tag.get('class') == ['product']) [<div class="product">Product 1</div>, <div class="product">Product 2</div>] 

For completeness sake, here are all such set attributes, from the BeautifulSoup source code:

# The HTML standard defines these attributes as containing a # space-separated list of values, not a single value. That is, # class="foo bar" means that the 'class' attribute has two values, # 'foo' and 'bar', not the single value 'foo bar'.  When we # encounter one of these attributes, we will parse its value into # a list of values if possible. Upon output, the list will be # converted back into a string. cdata_list_attributes = {     "*" : ['class', 'accesskey', 'dropzone'],     "a" : ['rel', 'rev'],     "link" :  ['rel', 'rev'],     "td" : ["headers"],     "th" : ["headers"],     "td" : ["headers"],     "form" : ["accept-charset"],     "object" : ["archive"],      # These are HTML5 specific, as are *.accesskey and *.dropzone above.     "area" : ["rel"],     "icon" : ["sizes"],     "iframe" : ["sandbox"],     "output" : ["for"],     } 
like image 163
Martijn Pieters Avatar answered Oct 16 '22 06:10

Martijn Pieters