I've been reading many q&a on how to remove all the html code from a string using python but none was satisfying. I need a way to remove all the tags, preserve/convert the html entities and work well with utf-8 strings.
Apparently BeautifulSoup is vulnerable to some specially crafted html strings, I built a simple parser with HTMLParser to get just the texts but I was losing the entities
from HTMLParser import HTMLParser
class MyHTMLParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.data = []
def handle_data(self, data):
self.data.append(data)
def handle_charref(self, name):
self.data.append(name)
def handle_entityref(self, ent):
self.data.append(ent)
gives me something like
[u'Asia, sp', u'cialiste du voyage ', ...
losing the entity for the accented "e" in spécialiste.
Using one of the many regexp you can find as answers to similar questions it will always have some edge cases that were not considered.
Is there any really good module I could use?
The re. sub() method will remove all of the HTML tags in the string by replacing them with empty strings.
The re. sub() method will strip all opening and closing HTML tags by replacing them with empty strings. Copied!
Strip_tags() is a function that allows you to strip out all HTML and PHP tags from a given string (parameter one), however you can also use parameter two to specify a list of HTML tags you want.
The HTML tags can be removed from a given string by using replaceAll() method of String class. We can remove the HTML tags from a given string by using a regular expression. After removing the HTML tags from a string, it will return a string as normal text.
bleach is excellent for this task. It does everything you need. It has an extensive test suite that checks for strange edge cases where tags could slip through. I have never had an issue with it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With