BeautifulSoup webscraping find_all( ): finding exact match

I'm using Python and BeautifulSoup for web scraping.

Lets say I have the following html code to scrape:

<body>     <div class="product">Product 1</div>     <div class="product">Product 2</div>     <div class="product special">Product 3</div>     <div class="product special">Product 4</div> </body>

Using BeautifulSoup, I want to find ONLY the products with the attribute class="product" (only Product 1 and 2), not the 'special' products

If I do the following:

result = soup.find_all('div', {'class': 'product'})

the result includes ALL the products (1,2,3, and 4).

What should I do to find products whose class EXACTLY matches 'product'??

The Code I ran:

from bs4 import BeautifulSoup import re  text = """ <body>     <div class="product">Product 1</div>     <div class="product">Product 2</div>     <div class="product special">Product 3</div>     <div class="product special">Product 4</div> </body>"""  soup = BeautifulSoup(text) result = soup.findAll(attrs={'class': re.compile(r"^product$")}) print result

Output:

[<div class="product">Product 1</div>, <div class="product">Product 2</div>, <div class="product special">Product 3</div>, <div class="product special">Product 4</div>]

How do I find a specific element with BeautifulSoup?

BeautifulSoup has a limited support for CSS selectors, but covers most commonly used ones. Use select() method to find multiple elements and select_one() to find a single element.

What is Find () method in BeautifulSoup?

find() method The find method is used for finding out the first tag with the specified name or id and returning an object of type bs4. Example: For instance, consider this simple HTML webpage having different paragraph tags.

What does Soup Find_all return?

find_all returns an object of ResultSet which offers index based access to the result of found occurrences and can be printed using a for loop. Unwanted values These are not desired most of the time. So, attributes like id , class , or value are used to further refine the search.

In BeautifulSoup 4, the class attribute (and several other attributes, such as accesskey and the headers attribute on table cell elements) is treated as a set; you match against individual elements listed in the attribute. This follows the HTML standard.

As such, you cannot limit the search to just one class.

You'll have to use a custom function here to match against the class instead:

result = soup.find_all(lambda tag: tag.name == 'div' and                                     tag.get('class') == ['product'])

I used a lambda to create an anonymous function; each tag is matched on name (must be 'div'), and the class attribute must be exactly equal to the list ['product']; e.g. have just the one value.

Demo:

>>> from bs4 import BeautifulSoup >>> text = """ ... <body> ...     <div class="product">Product 1</div> ...     <div class="product">Product 2</div> ...     <div class="product special">Product 3</div> ...     <div class="product special">Product 4</div> ... </body>""" >>> soup = BeautifulSoup(text) >>> soup.find_all(lambda tag: tag.name == 'div' and tag.get('class') == ['product']) [<div class="product">Product 1</div>, <div class="product">Product 2</div>]

For completeness sake, here are all such set attributes, from the BeautifulSoup source code:

# The HTML standard defines these attributes as containing a # space-separated list of values, not a single value. That is, # class="foo bar" means that the 'class' attribute has two values, # 'foo' and 'bar', not the single value 'foo bar'.  When we # encounter one of these attributes, we will parse its value into # a list of values if possible. Upon output, the list will be # converted back into a string. cdata_list_attributes = {     "*" : ['class', 'accesskey', 'dropzone'],     "a" : ['rel', 'rev'],     "link" :  ['rel', 'rev'],     "td" : ["headers"],     "th" : ["headers"],     "td" : ["headers"],     "form" : ["accept-charset"],     "object" : ["archive"],      # These are HTML5 specific, as are *.accesskey and *.dropzone above.     "area" : ["rel"],     "icon" : ["sizes"],     "iframe" : ["sandbox"],     "output" : ["for"],     }

BeautifulSoup webscraping find_all( ): finding exact match

Tags:

python

html

regex

beautifulsoup

web-scraping

user2436815

People also ask

1 Answers

Martijn Pieters

Recent Activity

Donate For Us

BeautifulSoup webscraping find_all( ): finding exact match

Tags:

python

html

regex

beautifulsoup

web-scraping

user2436815

People also ask

1 Answers

Martijn Pieters

Related questions

Recent Activity

Donate For Us