Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BeautifulSoup: How to get nested divs

Given the following code:

<html>
<body>
<div class="category1" id="foo">
      <div class="category2" id="bar">
            <div class="category3">
            </div>
            <div class="category4">
                 <div class="category5"> test
                 </div>
            </div>
      </div>
</div>
</body>
</html>

How to extract the word test from <div class="category5"> test using BeautifulSoup i.e how to deal with nested divs? I tried to lookup on the Internet but I didn't find any case that treat an easy to grasp example so I set up this one. Thanks.

like image 585
torr Avatar asked Oct 29 '14 09:10

torr


1 Answers

xpath should be the straight forward answer, however this is not supported in BeautifulSoup.

Updated: with a BeautifulSoup solution

To do so, given that you know the class and element (div) in this case, you can use a for/loop with attrs to get what you want:

from bs4 import BeautifulSoup

html = '''
<html>
<body>
<div class="category1" id="foo">
      <div class="category2" id="bar">
            <div class="category3">
            </div>
            <div class="category4">
                 <div class="category5"> test
                 </div>
            </div>
      </div>
</div>
</body>
</html>'''

content = BeautifulSoup(html)

for div in content.findAll('div', attrs={'class':'category5'}):
    print div.text

test

I have no problem extracting the text from your html sample, like @MartijnPieters suggested, you will need to find out why your div element is missing.

Another update

As you're missing lxml as a parser for BeautifulSoup, that's why None was returned as you haven't parsed anything to start with. Install lxml should solve your issue.

You may consider using lxml or similar which supports xpath, dead easy if you ask me.

from lxml import etree

tree = etree.fromstring(html) # or etree.parse from source
tree.xpath('.//div[@class="category5"]/text()')
[' test\n                 ']
like image 167
Anzel Avatar answered Nov 10 '22 08:11

Anzel