Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python HTML removal

Tags:

python

string

How can I remove all HTML from a string in Python? For example, how can I turn:

blah blah <a href="blah">link</a>

into

blah blah link

Thanks!

like image 580
user29772 Avatar asked Feb 28 '09 22:02

user29772


People also ask

How do I remove HTML from Python?

Use the re. sub() method to strip the HTML tags from a string, e.g. result = re. sub('<. *?>

How do you remove HTML tags from text data in Python?

Remove HTML tags from string in python Using the lxml Module The fromstring() method takes the original string as an input and returns a parser. After getting the parser, we can extract the text using the text_content() method, leaving behind the HTML tags. The text_content() method returns an object of lxml. etree.


4 Answers

When your regular expression solution hits a wall, try this super easy (and reliable) BeautifulSoup program.

from BeautifulSoup import BeautifulSoup

html = "<a> Keep me </a>"
soup = BeautifulSoup(html)

text_parts = soup.findAll(text=True)
text = ''.join(text_parts)
like image 114
Kenan Banks Avatar answered Sep 22 '22 08:09

Kenan Banks


There is also a small library called stripogram which can be used to strip away some or all HTML tags.

You can use it like this:

from stripogram import html2text, html2safehtml
# Only allow <b>, <a>, <i>, <br>, and <p> tags
clean_html = html2safehtml(original_html,valid_tags=("b", "a", "i", "br", "p"))
# Don't process <img> tags, just strip them out. Use an indent of 4 spaces 
# and a page that's 80 characters wide.
text = html2text(original_html,ignore_tags=("img",),indent_width=4,page_width=80)

So if you want to simply strip out all HTML, you pass valid_tags=() to the first function.

You can find the documentation here.

like image 40
MrTopf Avatar answered Sep 24 '22 08:09

MrTopf


You can use a regular expression to remove all the tags:

>>> import re
>>> s = 'blah blah <a href="blah">link</a>'
>>> re.sub('<[^>]*>', '', s)
'blah blah link'
like image 26
Luke Woodward Avatar answered Sep 25 '22 08:09

Luke Woodward


Regexs, BeautifulSoup, html2text don't work if an attribute has '>' in it. See Is “>” (U+003E GREATER-THAN SIGN) allowed inside an html-element attribute value?

'HTML/XML parser'-based solution might help in such cases e.g., stripogram suggested by @MrTopf does work.

Here's ElementTree-based solution:

####from xml.etree import ElementTree as etree # stdlib
from lxml import etree

str_ = 'blah blah <a href="blah">link</a> END'
root = etree.fromstring('<html>%s</html>' % str_)
print ''.join(root.itertext()) # lxml or ElementTree 1.3+

Output:

blah blah link END
like image 24
jfs Avatar answered Sep 22 '22 08:09

jfs