Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to select a class of div inside of a div with beautiful soup?

I have a bunch of div tags within div tags:

<div class="foo">
     <div class="bar">I want this</div>
     <div class="unwanted">Not this</div>
</div>
<div class="bar">Don't want this either
</div>

So I'm using python and beautiful soup to separate stuff out. I need all the "bar" class only when it is wrapped inside of a "foo" class div. Here's my code

from bs4 import BeautifulSoup
soup = BeautifulSoup(open(r'C:\test.htm'))
tag = soup.div
for each_div in soup.findAll('div',{'class':'foo'}):
    print(tag["bar"]).encode("utf-8")

Alternately, I tried:

from bs4 import BeautifulSoup
soup = BeautifulSoup(open(r'C:\test.htm'))
for each_div in soup.findAll('div',{'class':'foo'}):
     print(each_div.findAll('div',{'class':'bar'})).encode("utf-8")

What am I doing wrong? I would be just as happy with just a simple print(each_div) if I could remove the div class "unwanted" from the selection.

like image 958
parap Avatar asked Mar 06 '14 07:03

parap


People also ask

How do you get elements in BeautifulSoup?

Using CSS selectors to locate elements in BeautifulSoupUse select() method to find multiple elements and select_one() to find a single element.

Is navigable string editable in BeautifulSoup?

The navigablestring object is used to represent the contents of a tag. To access the contents, use “. string” with tag. You can replace the string with another string but you can't edit the existing string.


1 Answers

You can use find_all() to search every <div> elements with foo as attribute and for each one of them use find() for those with bar as attribute, like:

from bs4 import BeautifulSoup
import sys 

soup = BeautifulSoup(open(sys.argv[1], 'r'), 'html')
for foo in soup.find_all('div', attrs={'class': 'foo'}):
    bar = foo.find('div', attrs={'class': 'bar'})
    print(bar.text)

Run it like:

python3 script.py htmlfile

That yields:

I want this

UPDATE: Assuming there could exists several <div> elements with bar attribute, previous script won't work. It will only find the first one. But you could get their descendants and iterate them, like:

from bs4 import BeautifulSoup
import sys 

soup = BeautifulSoup(open(sys.argv[1], 'r'), 'html')
for foo in soup.find_all('div', attrs={'class': 'foo'}):
    foo_descendants = foo.descendants
    for d in foo_descendants:
        if d.name == 'div' and d.get('class', '') == ['bar']:
            print(d.text)

With an input like:

<div class="foo">
     <div class="bar">I want this</div>
     <div class="unwanted">Not this</div>
     <div class="bar">Also want this</div>
</div>

It will yield:

I want this
Also want this
like image 85
Birei Avatar answered Sep 17 '22 15:09

Birei