Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Strip HTML from strings in Python

Tags:

python

html

People also ask

How do you strip HTML in Python?

Use the re. sub() method to strip the HTML tags from a string, e.g. result = re. sub('<.

How do I strip a string in HTML?

To strip out all the HTML tags from a string there are lots of procedures in JavaScript. In order to strip out tags we can use replace() function and can also use . textContent property, . innerText property from HTML DOM.


I always used this function to strip HTML tags, as it requires only the Python stdlib:

For Python 3:

from io import StringIO
from html.parser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        super().__init__()
        self.reset()
        self.strict = False
        self.convert_charrefs= True
        self.text = StringIO()
    def handle_data(self, d):
        self.text.write(d)
    def get_data(self):
        return self.text.getvalue()

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

For Python 2:

from HTMLParser import HTMLParser
from StringIO import StringIO

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.text = StringIO()
    def handle_data(self, d):
        self.text.write(d)
    def get_data(self):
        return self.text.getvalue()

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

I haven't thought much about the cases it will miss, but you can do a simple regex:

re.sub('<[^<]+?>', '', text)

For those that don't understand regex, this searches for a string <...>, where the inner content is made of one or more (+) characters that isn't a <. The ? means that it will match the smallest string it can find. For example given <p>Hello</p>, it will match <'p> and </p> separately with the ?. Without it, it will match the entire string <..Hello..>.

If non-tag < appears in html (eg. 2 < 3), it should be written as an escape sequence &... anyway so the ^< may be unnecessary.


You can use BeautifulSoup get_text() feature.

from bs4 import BeautifulSoup

html_str = '''
<td><a href="http://www.fakewebsite.com">Please can you strip me?</a>
<br/><a href="http://www.fakewebsite.com">I am waiting....</a>
</td>
'''
soup = BeautifulSoup(html_str)

print(soup.get_text()) 
#or via attribute of Soup Object: print(soup.text)

It is advisable to explicitly specify the parser, for example as BeautifulSoup(html_str, features="html.parser"), for the output to be reproducible.


Short version!

import re, cgi
tag_re = re.compile(r'(<!--.*?-->|<[^>]*>)')

# Remove well-formed tags, fixing mistakes by legitimate users
no_tags = tag_re.sub('', user_input)

# Clean up anything else by escaping
ready_for_web = cgi.escape(no_tags)

Regex source: MarkupSafe. Their version handles HTML entities too, while this quick one doesn't.

Why can't I just strip the tags and leave it?

It's one thing to keep people from <i>italicizing</i> things, without leaving is floating around. But it's another to take arbitrary input and make it completely harmless. Most of the techniques on this page will leave things like unclosed comments (<!--) and angle-brackets that aren't part of tags (blah <<<><blah) intact. The HTMLParser version can even leave complete tags in, if they're inside an unclosed comment.

What if your template is {{ firstname }} {{ lastname }}? firstname = '<a' and lastname = 'href="http://evil.com/">' will be let through by every tag stripper on this page (except @Medeiros!), because they're not complete tags on their own. Stripping out normal HTML tags is not enough.

Django's strip_tags, an improved (see next heading) version of the top answer to this question, gives the following warning:

Absolutely NO guarantee is provided about the resulting string being HTML safe. So NEVER mark safe the result of a strip_tags call without escaping it first, for example with escape().

Follow their advice!

To strip tags with HTMLParser, you have to run it multiple times.

It's easy to circumvent the top answer to this question.

Look at this string (source and discussion):

<img<!-- --> src=x onerror=alert(1);//><!-- -->

The first time HTMLParser sees it, it can't tell that the <img...> is a tag. It looks broken, so HTMLParser doesn't get rid of it. It only takes out the <!-- comments -->, leaving you with

<img src=x onerror=alert(1);//>

This problem was disclosed to the Django project in March, 2014. Their old strip_tags was essentially the same as the top answer to this question. Their new version basically runs it in a loop until running it again doesn't change the string:

# _strip_once runs HTMLParser once, pulling out just the text of all the nodes.

def strip_tags(value):
    """Returns the given HTML with all tags stripped."""
    # Note: in typical case this loop executes _strip_once once. Loop condition
    # is redundant, but helps to reduce number of executions of _strip_once.
    while '<' in value and '>' in value:
        new_value = _strip_once(value)
        if len(new_value) >= len(value):
            # _strip_once was not able to detect more tags
            break
        value = new_value
    return value

Of course, none of this is an issue if you always escape the result of strip_tags().

Update 19 March, 2015: There was a bug in Django versions before 1.4.20, 1.6.11, 1.7.7, and 1.8c1. These versions could enter an infinite loop in the strip_tags() function. The fixed version is reproduced above. More details here.

Good things to copy or use

My example code doesn't handle HTML entities - the Django and MarkupSafe packaged versions do.

My example code is pulled from the excellent MarkupSafe library for cross-site scripting prevention. It's convenient and fast (with C speedups to its native Python version). It's included in Google App Engine, and used by Jinja2 (2.7 and up), Mako, Pylons, and more. It works easily with Django templates from Django 1.7.

Django's strip_tags and other html utilities from a recent version are good, but I find them less convenient than MarkupSafe. They're pretty self-contained, you could copy what you need from this file.

If you need to strip almost all tags, the Bleach library is good. You can have it enforce rules like "my users can italicize things, but they can't make iframes."

Understand the properties of your tag stripper! Run fuzz tests on it! Here is the code I used to do the research for this answer.

sheepish note - The question itself is about printing to the console, but this is the top Google result for "python strip html from string", so that's why this answer is 99% about the web.


I needed a way to strip tags and decode HTML entities to plain text. The following solution is based on Eloff's answer (which I couldn't use because it strips entities).

import html.parser

class HTMLTextExtractor(html.parser.HTMLParser):
    def __init__(self):
        super(HTMLTextExtractor, self).__init__()
        self.result = [ ]

    def handle_data(self, d):
        self.result.append(d)

    def get_text(self):
        return ''.join(self.result)

def html_to_text(html):
    """Converts HTML to plain text (stripping tags and converting entities).
    >>> html_to_text('<a href="#">Demo<!--...--> <em>(&not; \u0394&#x03b7;&#956;&#x03CE;)</em></a>')
    'Demo (\xac \u0394\u03b7\u03bc\u03ce)'

    "Plain text" doesn't mean result can safely be used as-is in HTML.
    >>> html_to_text('&lt;script&gt;alert("Hello");&lt;/script&gt;')
    '<script>alert("Hello");</script>'

    Always use html.escape to sanitize text before using in an HTML context!

    HTMLParser will do its best to make sense of invalid HTML.
    >>> html_to_text('x < y &lt z <!--b')
    'x < y < z '

    Named entities are handled as per HTML 5.
    >>> html_to_text('&nosuchentity; &apos; ')
    "&nosuchentity; ' "
    """
    s = HTMLTextExtractor()
    s.feed(html)
    return s.get_text()

A quick test:

html = '<a href="#">Demo <em>(&not; \u0394&#x03b7;&#956;&#x03CE;)</em></a>'
print(repr(html_to_text(html)))

Result:

'Demo (¬ Δημώ)'

Security note: Do not confuse HTML stripping (converting HTML into plain text) with HTML sanitizing (converting plain text into HTML). This answer will remove HTML and decode entities into plain text – that does not make the result safe to use in a HTML context.

Example: &lt;script&gt;alert("Hello");&lt;/script&gt; will be converted to <script>alert("Hello");</script>, which is 100% correct behavior, but obviously not sufficient if the resulting plain text is inserted as-is into an HTML page.

The rule is not hard: Any time you insert a plain-text string into HTML output, always HTML escape it (using html.escape(s)), even if you "know" that it doesn't contain HTML (e.g. because you stripped HTML content).

However, the OP asked about printing the result to the console, in which case no HTML escaping is needed. Instead you may want to strip ASCII control characters, as they can trigger unwanted behavior (especially on Unix systems):

import re
text = html_to_text(untrusted_html_input)
clean_text = re.sub(r'[\0-\x1f\x7f]+', '', text)
# Alternatively, if you want to allow newlines:
# clean_text = re.sub(r'[\0-\x09\x0b-\x1f\x7f]+', '', text)
print(clean_text)

There's a simple way to this:

def remove_html_markup(s):
    tag = False
    quote = False
    out = ""

    for c in s:
        if c == '<' and not quote:
            tag = True
        elif c == '>' and not quote:
            tag = False
        elif (c == '"' or c == "'") and tag:
            quote = not quote
        elif not tag:
            out = out + c

    return out

The idea is explained here: http://youtu.be/2tu9LTDujbw

You can see it working here: http://youtu.be/HPkNPcYed9M?t=35s

PS - If you're interested in the class(about smart debugging with python) I give you a link: http://www.udacity.com/overview/Course/cs259/CourseRev/1. It's free!

You're welcome! :)