How can I remove all HTML from a string in Python? For example, how can I turn:
blah blah <a href="blah">link</a>
into
blah blah link
Thanks!
Use the re. sub() method to strip the HTML tags from a string, e.g. result = re. sub('<. *?>
Remove HTML tags from string in python Using the lxml Module The fromstring() method takes the original string as an input and returns a parser. After getting the parser, we can extract the text using the text_content() method, leaving behind the HTML tags. The text_content() method returns an object of lxml. etree.
When your regular expression solution hits a wall, try this super easy (and reliable) BeautifulSoup program.
from BeautifulSoup import BeautifulSoup
html = "<a> Keep me </a>"
soup = BeautifulSoup(html)
text_parts = soup.findAll(text=True)
text = ''.join(text_parts)
There is also a small library called stripogram which can be used to strip away some or all HTML tags.
You can use it like this:
from stripogram import html2text, html2safehtml
# Only allow <b>, <a>, <i>, <br>, and <p> tags
clean_html = html2safehtml(original_html,valid_tags=("b", "a", "i", "br", "p"))
# Don't process <img> tags, just strip them out. Use an indent of 4 spaces
# and a page that's 80 characters wide.
text = html2text(original_html,ignore_tags=("img",),indent_width=4,page_width=80)
So if you want to simply strip out all HTML, you pass valid_tags=() to the first function.
You can find the documentation here.
You can use a regular expression to remove all the tags:
>>> import re
>>> s = 'blah blah <a href="blah">link</a>'
>>> re.sub('<[^>]*>', '', s)
'blah blah link'
Regexs, BeautifulSoup, html2text don't work if an attribute has '>
' in it. See Is “>” (U+003E GREATER-THAN SIGN) allowed inside an html-element attribute value?
'HTML/XML parser'-based solution might help in such cases e.g., stripogram suggested by @MrTopf does work.
Here's ElementTree-based solution:
####from xml.etree import ElementTree as etree # stdlib
from lxml import etree
str_ = 'blah blah <a href="blah">link</a> END'
root = etree.fromstring('<html>%s</html>' % str_)
print ''.join(root.itertext()) # lxml or ElementTree 1.3+
Output:
blah blah link END
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With