Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove a tag using BeautifulSoup but keep its contents

Currently I have code that does something like this:

soup = BeautifulSoup(value)

for tag in soup.findAll(True):
    if tag.name not in VALID_TAGS:
        tag.extract()
soup.renderContents()

Except I don't want to throw away the contents inside the invalid tag. How do I get rid of the tag but keep the contents inside when calling soup.renderContents()?

like image 807
Jason Christa Avatar asked Nov 19 '09 19:11

Jason Christa


People also ask

How do I remove tags from BeautifulSoup?

Remove tags with extract() BeautifulSoup has a built in method called extract() that allows you to remove a tag or string from the tree. Once you've located the element you want to get rid of, let's say it's named i_tag , calling i_tag. extract() will remove the element and return it at the same time.

How do you replace a tag in BeautifulSoup?

To replace a tag in Beautful Soup, find the element then call its replace_with method passing in either a string or tag.

Is tag editable in BeautifulSoup?

The navigablestring object is used to represent the contents of a tag. To access the contents, use “. string” with tag. You can replace the string with another string but you can't edit the existing string.

What function in BeautifulSoup will remove a tag from the HTML tree and destroy it?

Tag. decompose() removes a tag from the tree of a given HTML document, then completely destroys it and its contents.


3 Answers

Current versions of the BeautifulSoup library have an undocumented method on Tag objects called replaceWithChildren(). So, you could do something like this:

html = "<p>Good, <b>bad</b>, and <i>ug<b>l</b><u>y</u></i></p>"
invalid_tags = ['b', 'i', 'u']
soup = BeautifulSoup(html)
for tag in invalid_tags: 
    for match in soup.findAll(tag):
        match.replaceWithChildren()
print soup

Looks like it behaves like you want it to and is fairly straightforward code (although it does make a few passes through the DOM, but this could easily be optimized.)

like image 51
slacy Avatar answered Oct 21 '22 20:10

slacy


The strategy I used is to replace a tag with its contents if they are of type NavigableString and if they aren't, then recurse into them and replace their contents with NavigableString, etc. Try this:

from BeautifulSoup import BeautifulSoup, NavigableString

def strip_tags(html, invalid_tags):
    soup = BeautifulSoup(html)

    for tag in soup.findAll(True):
        if tag.name in invalid_tags:
            s = ""

            for c in tag.contents:
                if not isinstance(c, NavigableString):
                    c = strip_tags(unicode(c), invalid_tags)
                s += unicode(c)

            tag.replaceWith(s)

    return soup

html = "<p>Good, <b>bad</b>, and <i>ug<b>l</b><u>y</u></i></p>"
invalid_tags = ['b', 'i', 'u']
print strip_tags(html, invalid_tags)

The result is:

<p>Good, bad, and ugly</p>

I gave this same answer on another question. It seems to come up a lot.

like image 27
Jesse Dhillon Avatar answered Oct 21 '22 19:10

Jesse Dhillon


Although this has already been mentoned by other people in the comments, I thought I'd post a full answer showing how to do it with Mozilla's Bleach. Personally, I think this is a lot nicer than using BeautifulSoup for this.

import bleach
html = "<b>Bad</b> <strong>Ugly</strong> <script>Evil()</script>"
clean = bleach.clean(html, tags=[], strip=True)
print clean # Should print: "Bad Ugly Evil()"
like image 43
corford Avatar answered Oct 21 '22 20:10

corford