I'm using Python and BeautifulSoup for web scraping.
Lets say I have the following html code to scrape:
<body> <div class="product">Product 1</div> <div class="product">Product 2</div> <div class="product special">Product 3</div> <div class="product special">Product 4</div> </body>
Using BeautifulSoup, I want to find ONLY the products with the attribute class="product" (only Product 1 and 2), not the 'special' products
If I do the following:
result = soup.find_all('div', {'class': 'product'})
the result includes ALL the products (1,2,3, and 4).
What should I do to find products whose class EXACTLY matches 'product'??
The Code I ran:
from bs4 import BeautifulSoup import re text = """ <body> <div class="product">Product 1</div> <div class="product">Product 2</div> <div class="product special">Product 3</div> <div class="product special">Product 4</div> </body>""" soup = BeautifulSoup(text) result = soup.findAll(attrs={'class': re.compile(r"^product$")}) print result
Output:
[<div class="product">Product 1</div>, <div class="product">Product 2</div>, <div class="product special">Product 3</div>, <div class="product special">Product 4</div>]
BeautifulSoup has a limited support for CSS selectors, but covers most commonly used ones. Use select() method to find multiple elements and select_one() to find a single element.
find() method The find method is used for finding out the first tag with the specified name or id and returning an object of type bs4. Example: For instance, consider this simple HTML webpage having different paragraph tags.
find_all returns an object of ResultSet which offers index based access to the result of found occurrences and can be printed using a for loop. Unwanted values These are not desired most of the time. So, attributes like id , class , or value are used to further refine the search.
In BeautifulSoup 4, the class
attribute (and several other attributes, such as accesskey
and the headers
attribute on table cell elements) is treated as a set; you match against individual elements listed in the attribute. This follows the HTML standard.
As such, you cannot limit the search to just one class.
You'll have to use a custom function here to match against the class instead:
result = soup.find_all(lambda tag: tag.name == 'div' and tag.get('class') == ['product'])
I used a lambda
to create an anonymous function; each tag is matched on name (must be 'div'
), and the class attribute must be exactly equal to the list ['product']
; e.g. have just the one value.
Demo:
>>> from bs4 import BeautifulSoup >>> text = """ ... <body> ... <div class="product">Product 1</div> ... <div class="product">Product 2</div> ... <div class="product special">Product 3</div> ... <div class="product special">Product 4</div> ... </body>""" >>> soup = BeautifulSoup(text) >>> soup.find_all(lambda tag: tag.name == 'div' and tag.get('class') == ['product']) [<div class="product">Product 1</div>, <div class="product">Product 2</div>]
For completeness sake, here are all such set attributes, from the BeautifulSoup source code:
# The HTML standard defines these attributes as containing a # space-separated list of values, not a single value. That is, # class="foo bar" means that the 'class' attribute has two values, # 'foo' and 'bar', not the single value 'foo bar'. When we # encounter one of these attributes, we will parse its value into # a list of values if possible. Upon output, the list will be # converted back into a string. cdata_list_attributes = { "*" : ['class', 'accesskey', 'dropzone'], "a" : ['rel', 'rev'], "link" : ['rel', 'rev'], "td" : ["headers"], "th" : ["headers"], "td" : ["headers"], "form" : ["accept-charset"], "object" : ["archive"], # These are HTML5 specific, as are *.accesskey and *.dropzone above. "area" : ["rel"], "icon" : ["sizes"], "iframe" : ["sandbox"], "output" : ["for"], }
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With