I'm having two different sets of div tags in HTML:
<div class="ABC BCD CDE123">
<div class="ABC BCD CDE234">
<div class="ABC BCD CDE345">
and
<div class="ABC XYZ BCD">
I want to select all the tags with ABC and BCD in it, but not containing the XYZ class with BeautifullSoup4.
I already know about this approach:
soup.find_all('div', class_=['ABC','BCD'])
Which searches as OR
(so ABC or BCD must be present).
I also know about that approach here:
def myfunction(theclass):
return theclass is not None and len(theclass)=5
soup.find_all('div', class_=myfunction)
Which will return all divs with a classname length of 5
I then tried to solve my problem with this:
soup.find_all('div', class_ = lambda x: x and 'ABC' and 'BCD' in x.split() and x and 'XYZ' not in x.split())
But this was not working. So I tried to debug it with this approach:
def myfunction(theclass):
print theclass
return True
soup.find_all('div', class_=myfunction)
The problem seems to be, that from a tag like this:
<div class="ABC BCD CDE123">
Only 'ABC' is handed over to myfunction
, so theclass = 'ABC'
and not theclass ='ABC BCD CDE123'
what I would have expected.
That's also the reason I guess why the lambda function fails.
Any clue how I can filter the tags acording to my requirement:
I want to select all the tags with ABC and BCD in it, but not containing the XYZ class with BeautifullSoup4.
find is used for returning the result when the searched element is found on the page. find_all is used for returning all the matches after scanning the entire document.
This can be done using SET. Get the list of all result with class ABC and BCD. Enclose result in python SET. Apply the same for XYZ. You will now have two SET one for ABC and BCD and other for XYZ. Subtract both set
To Use ABC and BCD in the search list, use select function instead of find_all
from bs4 import BeautifulSoup
data = '''
<div class="ABC BCD CDE123"></div>
<div class="ABC BCD CDE234"></div>
<div class="ABC BCD CDE345"></div>
<div class="ABC XYZ BCD"></div>
<div class="ABC XYZ AAC"></div>
<div class="ABC AAC"></div>
'''
soup = BeautifulSoup(data)
ABC_BCD = set(soup.select('div.ABC.BCD'))
XYZ = set(soup.select('div.XYZ'))
result = ABC_BCD - XYZ
for element in result:
print element
output
<div class="ABC BCD CDE234"></div>
<div class="ABC BCD CDE123"></div>
<div class="ABC BCD CDE345"></div>
With same code using find_all
ABC_BCD = set(soup.find_all('div', class_=['ABC','BCD']))
XYZ = set(soup.find_all('div', class_=['XYZ']))
result = ABC-BCD
for element in result:
print element
output is
<div class="ABC BCD CDE234"></div>
<div class="ABC AAC"></div> #This is what we dont need
<div class="ABC BCD CDE123"></div>
<div class="ABC BCD CDE345"></div>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With