Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BeautifulSoup: find Class names: AND + NOT

I'm having two different sets of div tags in HTML:

<div class="ABC BCD CDE123">

<div class="ABC BCD CDE234">

<div class="ABC BCD CDE345">

and

<div class="ABC XYZ BCD">

I want to select all the tags with ABC and BCD in it, but not containing the XYZ class with BeautifullSoup4.

I already know about this approach:

soup.find_all('div', class_=['ABC','BCD'])

Which searches as OR (so ABC or BCD must be present).

I also know about that approach here:

def myfunction(theclass):
    return theclass is not None and len(theclass)=5
soup.find_all('div', class_=myfunction)

Which will return all divs with a classname length of 5

I then tried to solve my problem with this:

soup.find_all('div', class_ = lambda x: x and 'ABC' and 'BCD' in x.split() and x and 'XYZ' not in x.split())

But this was not working. So I tried to debug it with this approach:

def myfunction(theclass):
    print theclass
    return True
soup.find_all('div', class_=myfunction)

The problem seems to be, that from a tag like this:

<div class="ABC BCD CDE123">

Only 'ABC' is handed over to myfunction, so theclass = 'ABC' and not theclass ='ABC BCD CDE123' what I would have expected. That's also the reason I guess why the lambda function fails.

Any clue how I can filter the tags acording to my requirement:

I want to select all the tags with ABC and BCD in it, but not containing the XYZ class with BeautifullSoup4.

like image 268
stoney Avatar asked Jul 05 '18 11:07

stoney


People also ask

What is the difference between Find_all () and find () in BeautifulSoup?

find is used for returning the result when the searched element is found on the page. find_all is used for returning all the matches after scanning the entire document.


Video Answer


1 Answers

This can be done using SET. Get the list of all result with class ABC and BCD. Enclose result in python SET. Apply the same for XYZ. You will now have two SET one for ABC and BCD and other for XYZ. Subtract both set

To Use ABC and BCD in the search list, use select function instead of find_all

from bs4 import BeautifulSoup

data = '''
<div class="ABC BCD CDE123"></div>
<div class="ABC BCD CDE234"></div>
<div class="ABC BCD CDE345"></div>
<div class="ABC XYZ BCD"></div>
<div class="ABC XYZ AAC"></div>
<div class="ABC AAC"></div>
'''

soup = BeautifulSoup(data)
ABC_BCD = set(soup.select('div.ABC.BCD'))
XYZ     = set(soup.select('div.XYZ'))
result = ABC_BCD - XYZ
for element in result:
    print element

output

<div class="ABC BCD CDE234"></div>
<div class="ABC BCD CDE123"></div>
<div class="ABC BCD CDE345"></div>

With same code using find_all

ABC_BCD = set(soup.find_all('div', class_=['ABC','BCD']))
XYZ     = set(soup.find_all('div', class_=['XYZ']))
result = ABC-BCD
for element in result:
    print element

output is

<div class="ABC BCD CDE234"></div>
<div class="ABC AAC"></div> #This is what we dont need
<div class="ABC BCD CDE123"></div>
<div class="ABC BCD CDE345"></div>
like image 190
Saurabh Pandey Avatar answered Sep 19 '22 11:09

Saurabh Pandey