Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract outer div using BeautifulSoup

If the HTML code looks like this:

<div class="div1">
<p>hello</p>
<p>hi</p>
    <div class="nesteddiv">
        <p>one</p>
        <p>two</p>
        <p>three</p>
    </div>
</div>

How do I extract just

<div class="div1">
    <p>hello</p>
    <p>hi</p>
</div>

I already tried parser.find('div', 'div1') but I'm getting the whole div including the nested one.

like image 781
John Wine Avatar asked Apr 02 '26 17:04

John Wine


2 Answers

You actually want to extract() the nested div from the document and then get the first div. Here is an example (where html is the HTML you provided in the question):

>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> soup.div.div.extract()
<div class="nesteddiv">
<p>one</p>
<p>two</p>
<p>three</p>
</div>
>>> soup.div
<div class="div1">
<p>hello</p>
<p>hi</p>

</div>
like image 166
Johnsyweb Avatar answered Apr 04 '26 08:04

Johnsyweb


Why not just find() the nested div and then remove it from the tree using extract()?

like image 31
Alexander Tsepkov Avatar answered Apr 04 '26 06:04

Alexander Tsepkov