Does anyone have some sample code that illustrates how to use Python's Beautiful Soup to strip all html tags, except some, from a string of text? I want to strip all javascript and html tags everything except: <pre class="prettyprint"><code><a></a> </code></pre> And also things like: <pre class="prettyprint"><code><a onclick=""></a> </code></pre> Thanks for helping -- I couldn't find much on the internet for this purpose.

<pre class="prettyprint"><code>import BeautifulSoup doc = '''<html><head><title>Page title</title></head><body>This is paragraph <a onclick="">one</a>.This is paragraph two.</html>''' soup = BeautifulSoup.BeautifulSoup(doc) for tag in soup.recursiveChildGenerator(): if isinstance(tag,BeautifulSoup.Tag) and tag.name in ('a','b','i'): print(tag) </code></pre> yields <pre class="prettyprint"><code>paragraph <a onclick="">one</a> paragraph two </code></pre> If you just want the text contents, you could change <code>print(tag)</code> to <code>print(tag.string)</code>. If you want to remove an attribute like <code>onclick=""</code> from the <code>a</code> tag, you could do this: <pre class="prettyprint"><code>if isinstance(tag,BeautifulSoup.Tag) and tag.name in ('a','b','i'): if tag.name=='a': del tag['onclick'] print(tag) </code></pre>

Using Beautiful Soup to strip html tags from a string

Tags:

python

beautifulsoup

Does anyone have some sample code that illustrates how to use Python's Beautiful Soup to strip all html tags, except some, from a string of text?

I want to strip all javascript and html tags everything except:

<a></a>
<b></b>
<i></i>

And also things like:

<a onclick=""></a>

Thanks for helping -- I couldn't find much on the internet for this purpose.

838

asked Dec 12 '10 20:12

ensnare

1 Answers

import BeautifulSoup

doc = '''<html><head><title>Page title</title></head><body><p id="firstpara" align="center">This is <i>paragraph</i> <a onclick="">one</a>.<p id="secondpara" align="blah">This is <i>paragraph</i> <b>two</b>.</html>'''
soup = BeautifulSoup.BeautifulSoup(doc)

for tag in soup.recursiveChildGenerator():
    if isinstance(tag,BeautifulSoup.Tag) and tag.name in ('a','b','i'):
        print(tag)

yields

<i>paragraph</i>
<a onclick="">one</a>
<i>paragraph</i>
<b>two</b>

If you just want the text contents, you could change print(tag) to print(tag.string).

If you want to remove an attribute like onclick="" from the a tag, you could do this:

if isinstance(tag,BeautifulSoup.Tag) and tag.name in ('a','b','i'):
    if tag.name=='a':
        del tag['onclick']
    print(tag)

answered Oct 20 '22 11:10

unutbu

Related questions
                            
                                Execute a python command within vim and getting the output
                            
                                Is it bad practice to use self in decorators?
                            
                                How do I add basic authentication to a Python REST request?
                            
                                Crop non symmetric area of an image with Python/PIL
                            
                                Does thread-local mean thread safe?
                            
                                Creating a website to communicate with an embedded device
                            
                                How to declare the welcome file (e.g. index.html) in app.yaml
                            
                                Trying to group values?
                            
                                Neural network library for Python? [closed]
                            
                                Is there a way to move many files quickly in Python?
                            
                                Is it possible for SymPy to render LaTeX for use in a GUI?
                            
                                How do I count words in an nltk plaintextcorpus faster?
                            
                                python decimals - rounding to nearest whole dollar (no cents) - with ROUND_HALF_UP
                            
                                Using custom packages on my python project
                            
                                Python Regular Expression
                            
                                What is the proper python way to write methods that only take a particular type?
                            
                                Get an Integer from Entry
                            
                                How do I create a series of high- and low-pitch beeps using Ruby or Python? [closed]
                            
                                Hidden Multithreading Bottlenecks in Jython?
                            
                                list comprehension question

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With