Currently I have code that does something like this: <pre class="prettyprint"><code>soup = BeautifulSoup(value) for tag in soup.findAll(True): if tag.name not in VALID_TAGS: tag.extract() soup.renderContents() </code></pre> Except I don't want to throw away the contents inside the invalid tag. How do I get rid of the tag but keep the contents inside when calling soup.renderContents()?

Current versions of the BeautifulSoup library have an undocumented method on Tag objects called replaceWithChildren(). So, you could do something like this: <pre class="prettyprint"><code>html = "Good, bad, and ugly" invalid_tags = ['b', 'i', 'u'] soup = BeautifulSoup(html) for tag in invalid_tags: for match in soup.findAll(tag): match.replaceWithChildren() print soup </code></pre> Looks like it behaves like you want it to and is fairly straightforward code (although it does make a few passes through the DOM, but this could easily be optimized.)

Although this has already been mentoned by other people in the comments, I thought I'd post a full answer showing how to do it with Mozilla's Bleach. Personally, I think this is a lot nicer than using BeautifulSoup for this. <pre class="prettyprint"><code>import bleach html = "Bad Ugly <script>Evil()</script>" clean = bleach.clean(html, tags=[], strip=True) print clean # Should print: "Bad Ugly Evil()" </code></pre>

Remove a tag using BeautifulSoup but keep its contents

Tags:

python

beautifulsoup

Currently I have code that does something like this:

soup = BeautifulSoup(value)

for tag in soup.findAll(True):
    if tag.name not in VALID_TAGS:
        tag.extract()
soup.renderContents()

Except I don't want to throw away the contents inside the invalid tag. How do I get rid of the tag but keep the contents inside when calling soup.renderContents()?

807

asked Nov 19 '09 19:11

Jason Christa

3 Answers

Current versions of the BeautifulSoup library have an undocumented method on Tag objects called replaceWithChildren(). So, you could do something like this:

html = "<p>Good, <b>bad</b>, and <i>ug<b>l</b><u>y</u></i></p>"
invalid_tags = ['b', 'i', 'u']
soup = BeautifulSoup(html)
for tag in invalid_tags: 
    for match in soup.findAll(tag):
        match.replaceWithChildren()
print soup

Looks like it behaves like you want it to and is fairly straightforward code (although it does make a few passes through the DOM, but this could easily be optimized.)

answered Oct 21 '22 20:10

slacy

The strategy I used is to replace a tag with its contents if they are of type NavigableString and if they aren't, then recurse into them and replace their contents with NavigableString, etc. Try this:

from BeautifulSoup import BeautifulSoup, NavigableString

def strip_tags(html, invalid_tags):
    soup = BeautifulSoup(html)

    for tag in soup.findAll(True):
        if tag.name in invalid_tags:
            s = ""

            for c in tag.contents:
                if not isinstance(c, NavigableString):
                    c = strip_tags(unicode(c), invalid_tags)
                s += unicode(c)

            tag.replaceWith(s)

    return soup

html = "<p>Good, <b>bad</b>, and <i>ug<b>l</b><u>y</u></i></p>"
invalid_tags = ['b', 'i', 'u']
print strip_tags(html, invalid_tags)

The result is:

<p>Good, bad, and ugly</p>

I gave this same answer on another question. It seems to come up a lot.

answered Oct 21 '22 19:10

Jesse Dhillon

Although this has already been mentoned by other people in the comments, I thought I'd post a full answer showing how to do it with Mozilla's Bleach. Personally, I think this is a lot nicer than using BeautifulSoup for this.

import bleach
html = "<b>Bad</b> <strong>Ugly</strong> <script>Evil()</script>"
clean = bleach.clean(html, tags=[], strip=True)
print clean # Should print: "Bad Ugly Evil()"

answered Oct 21 '22 20:10

corford

Related questions
                            
                                How to adjust the quality of a resized image in Python Imaging Library?
                            
                                Upgrade python without breaking yum
                            
                                Check if a string is hexadecimal
                            
                                set difference for pandas
                            
                                pandas applying regex to replace values
                            
                                Find "home directory" in Python? [duplicate]
                            
                                Suppress "None" output as string in Jinja2
                            
                                Is it possible to add PyQt4/PySide packages on a Virtualenv sandbox?
                            
                                How can I insert data into a MySQL database?
                            
                                How to get column names from SQLAlchemy result (declarative syntax)
                            
                                Use index in pandas to plot data
                            
                                What's the best way to handle Django's objects.get?
                            
                                How can I access the current executing module or class name in Python?
                            
                                Python 'self' keyword
                            
                                Parse HTML table to Python list?
                            
                                How do I download NLTK data?
                            
                                CSS Problems with Flask Web App
                            
                                NameError: name 'request' is not defined
                            
                                Reliably detect Windows in Python
                            
                                error: could not create '/usr/local/lib/python2.7/dist-packages/virtualenv_support': Permission denied

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With