Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HTML Entity Codes to Text [duplicate]

Does anyone know an easy way in Python to convert a string with HTML entity codes (e.g. &lt; &amp;) to a normal string (e.g. < &)?

cgi.escape() will escape strings (poorly), but there is no unescape().

like image 207
tghw Avatar asked Mar 19 '09 17:03

tghw


People also ask

What is &hellip in HTML?

That's an HTML entity; it stands for Horizontal Ellipsis. It looks like this: … ( One symbol; not three separate dots) That means that their HTML is broken.

What is HTML &GT?

&gt; and &lt; is a character entity reference for the > and < character in HTML. It is not possible to use the less than (<) or greater than (>) signs in your file, because the browser will mix them with tags. for these difficulties you can use entity names( &gt; ) and entity numbers( &#60; ).


2 Answers

HTMLParser has the functionality in the standard library. It is, unfortunately, undocumented:

(Python2 Docs)

>>> import HTMLParser
>>> h= HTMLParser.HTMLParser()
>>> h.unescape('alpha &lt; &beta;')
u'alpha < \u03b2'

(Python 3 Docs)

>>> import html.parser
>>> h = html.parser.HTMLParser()
>>> h.unescape('alpha &lt; &beta;')
'alpha < \u03b2'

htmlentitydefs is documented, but requires you to do a lot of the work yourself.

If you only need the XML predefined entities (lt, gt, amp, quot, apos), you could use minidom to parse them. If you only need the predefined entities and no numeric character references, you could even just use a plain old string replace for speed.

like image 108
bobince Avatar answered Sep 27 '22 19:09

bobince


I forgot to tag it at first, but I'm using BeautifulSoup.

Digging around in the documentation, I found:

soup = BeautifulSoup(html, convertEntities=BeautifulSoup.HTML_ENTITIES)

does it exactly as I was hoping.

like image 30
tghw Avatar answered Sep 27 '22 20:09

tghw