How can I remove all HTML from a string in Python? For example, how can I turn: <pre class="prettyprint"><code>blah blah <a href="blah">link</a> </code></pre> into <pre class="prettyprint"><code>blah blah link </code></pre> Thanks!

When your regular expression solution hits a wall, try this super easy (and reliable) BeautifulSoup program. <pre class="prettyprint"><code>from BeautifulSoup import BeautifulSoup html = "<a> Keep me </a>" soup = BeautifulSoup(html) text_parts = soup.findAll(text=True) text = ''.join(text_parts) </code></pre>

There is also a small library called stripogram which can be used to strip away some or all HTML tags. You can use it like this: <pre class="prettyprint"><code>from stripogram import html2text, html2safehtml # Only allow , <a>, , , and tags clean_html = html2safehtml(original_html,valid_tags=("b", "a", "i", "br", "p")) # Don't process <img> tags, just strip them out. Use an indent of 4 spaces # and a page that's 80 characters wide. text = html2text(original_html,ignore_tags=("img",),indent_width=4,page_width=80) </code></pre> So if you want to simply strip out all HTML, you pass valid_tags=() to the first function. You can find the documentation here.

You can use a regular expression to remove all the tags: <pre class="prettyprint"><code>>>> import re >>> s = 'blah blah <a href="blah">link</a>' >>> re.sub('<[^>]*>', '', s) 'blah blah link' </code></pre>

Regexs, BeautifulSoup, html2text don't work if an attribute has '<code>></code>' in it. See Is “>” (U+003E GREATER-THAN SIGN) allowed inside an html-element attribute value? 'HTML/XML parser'-based solution might help in such cases e.g., stripogram suggested by @MrTopf does work. Here's ElementTree-based solution: <pre class="prettyprint"><code>####from xml.etree import ElementTree as etree # stdlib from lxml import etree str_ = 'blah blah <a href="blah">link</a> END' root = etree.fromstring('<html>%s</html>' % str_) print ''.join(root.itertext()) # lxml or ElementTree 1.3+ </code></pre> Output: <pre class="prettyprint"><code>blah blah link END </code></pre>

Python HTML removal

Tags:

python

string

How can I remove all HTML from a string in Python? For example, how can I turn:

Click to copy

blah blah <a href="blah">link</a>

into

Click to copy

blah blah link

Thanks!

580

asked Feb 28 '09 22:02

user29772

4 Answers

When your regular expression solution hits a wall, try this super easy (and reliable) BeautifulSoup program.

Click to copy

from BeautifulSoup import BeautifulSoup

html = "<a> Keep me </a>"
soup = BeautifulSoup(html)

text_parts = soup.findAll(text=True)
text = ''.join(text_parts)

114

answered Sep 22 '22 08:09

Kenan Banks

There is also a small library called stripogram which can be used to strip away some or all HTML tags.

You can use it like this:

Click to copy

from stripogram import html2text, html2safehtml
# Only allow <b>, <a>, <i>, <br>, and <p> tags
clean_html = html2safehtml(original_html,valid_tags=("b", "a", "i", "br", "p"))
# Don't process <img> tags, just strip them out. Use an indent of 4 spaces 
# and a page that's 80 characters wide.
text = html2text(original_html,ignore_tags=("img",),indent_width=4,page_width=80)

So if you want to simply strip out all HTML, you pass valid_tags=() to the first function.

You can find the documentation here.

answered Sep 24 '22 08:09

MrTopf

You can use a regular expression to remove all the tags:

Click to copy

>>> import re
>>> s = 'blah blah <a href="blah">link</a>'
>>> re.sub('<[^>]*>', '', s)
'blah blah link'

answered Sep 25 '22 08:09

Luke Woodward

Regexs, BeautifulSoup, html2text don't work if an attribute has '>' in it. See Is “>” (U+003E GREATER-THAN SIGN) allowed inside an html-element attribute value?

'HTML/XML parser'-based solution might help in such cases e.g., stripogram suggested by @MrTopf does work.

Here's ElementTree-based solution:

Click to copy

####from xml.etree import ElementTree as etree # stdlib
from lxml import etree

str_ = 'blah blah <a href="blah">link</a> END'
root = etree.fromstring('<html>%s</html>' % str_)
print ''.join(root.itertext()) # lxml or ElementTree 1.3+

Output:

Click to copy

blah blah link END

answered Sep 22 '22 08:09

jfs

Related questions
                            
                                Calculate (road travel) distance between postcodes/zipcodes python
                            
                                how to tell if a string is base64 or not
                            
                                is a there md5 decrypt function in python? [duplicate]
                            
                                Reading a binary file into a struct
                            
                                byte reverse AB CD to CD AB with python
                            
                                How do I ensure parameter is correct type in Python?
                            
                                How to install pandas in pycharm
                            
                                Why does tesseract fail to read text off this simple image?
                            
                                Make Python bool print 'On' or 'Off' rather than 'True' or 'False'
                            
                                What Is ** In Python?
                            
                                Why do I get "expected an indented block" when I try to run my Python script? [closed]
                            
                                Troubles installing mysqlclient with pip3
                            
                                Error while opening port in Python using TI Chronos
                            
                                virtualenv command not found after installed with MacPorts
                            
                                How can I use a 'for' loop for just one variable in a function which depends on two variables?
                            
                                Is there a better way to convert a list to a dictionary in Python with keys but no values?
                            
                                PyQt4 signals and slots
                            
                                Syntax error iterating over tuple in python
                            
                                Save unicode in redis but fetch error
                            
                                Django/MySQL-python - Connection using old (pre-4.1.1) authentication protocol refused (client option 'secure_auth' enabled)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python HTML removal

Tags:

python

string

user29772

People also ask

4 Answers

Kenan Banks

MrTopf

Luke Woodward

jfs

Recent Activity

Donate For Us