Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Safely remove all html code from a string in python

I've been reading many q&a on how to remove all the html code from a string using python but none was satisfying. I need a way to remove all the tags, preserve/convert the html entities and work well with utf-8 strings.

Apparently BeautifulSoup is vulnerable to some specially crafted html strings, I built a simple parser with HTMLParser to get just the texts but I was losing the entities

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.data = []

    def handle_data(self, data):
        self.data.append(data)

    def handle_charref(self, name):
        self.data.append(name)

    def handle_entityref(self, ent):
        self.data.append(ent)

gives me something like

[u'Asia, sp', u'cialiste du voyage ', ...

losing the entity for the accented "e" in spécialiste.

Using one of the many regexp you can find as answers to similar questions it will always have some edge cases that were not considered.

Is there any really good module I could use?

like image 471
Arjuna Del Toso Avatar asked Apr 09 '13 00:04

Arjuna Del Toso


People also ask

How do you remove all HTML tags from a string in Python?

The re. sub() method will remove all of the HTML tags in the string by replacing them with empty strings.

How do you strip HTML in Python?

The re. sub() method will strip all opening and closing HTML tags by replacing them with empty strings. Copied!

Is it possible to remove the HTML tags from data?

Strip_tags() is a function that allows you to strip out all HTML and PHP tags from a given string (parameter one), however you can also use parameter two to specify a list of HTML tags you want.

How do I remove a string in HTML?

The HTML tags can be removed from a given string by using replaceAll() method of String class. We can remove the HTML tags from a given string by using a regular expression. After removing the HTML tags from a string, it will return a string as normal text.


1 Answers

bleach is excellent for this task. It does everything you need. It has an extensive test suite that checks for strange edge cases where tags could slip through. I have never had an issue with it.

like image 100
Tim Heap Avatar answered Sep 25 '22 05:09

Tim Heap