Safely remove all html code from a string in python

Tags:

I've been reading many q&a on how to remove all the html code from a string using python but none was satisfying. I need a way to remove all the tags, preserve/convert the html entities and work well with utf-8 strings.

Apparently BeautifulSoup is vulnerable to some specially crafted html strings, I built a simple parser with HTMLParser to get just the texts but I was losing the entities

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.data = []

    def handle_data(self, data):
        self.data.append(data)

    def handle_charref(self, name):
        self.data.append(name)

    def handle_entityref(self, ent):
        self.data.append(ent)

gives me something like

[u'Asia, sp', u'cialiste du voyage ', ...

losing the entity for the accented "e" in spécialiste.

Using one of the many regexp you can find as answers to similar questions it will always have some edge cases that were not considered.

Is there any really good module I could use?

471

asked Apr 09 '13 00:04

Arjuna Del Toso

1 Answers

bleach is excellent for this task. It does everything you need. It has an extensive test suite that checks for strange edge cases where tags could slip through. I have never had an issue with it.

100

answered Sep 25 '22 05:09

Tim Heap

Related questions
                            
                                Raise an error in Python, exclude the last level in stack trace [duplicate]
                            
                                Open a Python File in Notepad++ from a Program
                            
                                Unable to rotate a matplotlib patch object about a specific point using rotate_around( )
                            
                                pythonanywhere 404 error
                            
                                Anaconda Acclerate / NumbaPro CUDA Linking Error OSX
                            
                                Python/Numpy - Cross Product of Matching Rows in Two Arrays
                            
                                Python Create Access database using win32com
                            
                                getPass() echoing password in Eclipse
                            
                                Python exec and __name__
                            
                                python fabric no host found must manually set 'env.host_string'
                            
                                Where are Python's stdlib tests?
                            
                                Why does using /usr/bin/env break my Python import?
                            
                                Python, get index from list of lists
                            
                                Find mixed types in Pandas columns
                            
                                real time subprocess.Popen via stdout and PIPE
                            
                                Convert a string with date and time to a date [duplicate]
                            
                                does xlwt support xlsx Format
                            
                                Python - Display 3D Point Cloud [closed]
                            
                                Comparing two .txt files using difflib in Python
                            
                                How to make a python decorator function in Flask with arguments (for authorization)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Safely remove all html code from a string in python

Tags:

python

html

security

parsing

utf-8

Arjuna Del Toso

People also ask

1 Answers

Tim Heap

Recent Activity

Donate For Us