Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HTMLParser.HTMLParser().unescape() doesn't work

I would like to convert HTML entities back to its human readable format, e.g. '£' to '£', '°' to '°' etc.

I've read several posts regarding this question

Converting html source content into readable format with Python 2.x

Decode HTML entities in Python string?

Convert XML/HTML Entities into Unicode String in Python

and according to them, I chose to use the undocumented function unescape(), but it doesn't work for me...

My code sample is like:

import HTMLParser

htmlParser = HTMLParser.HTMLParser()
decoded = htmlParser.unescape('© 2013')
print decoded

When I ran this python script, the output is still:

© 2013

instead of

© 2013

I'm using Python 2.X, working on Windows 7 and Cygwin console. I googled and didn't find any similar problems..Could anyone help me with this?

like image 524
D.Q. Avatar asked Mar 24 '23 02:03

D.Q.


2 Answers

Apparently HTMLParser.unescape was a bit more primitive before Python 2.6.

Python 2.5:

>>> import HTMLParser
>>> HTMLParser.HTMLParser().unescape('©')
'©'

Python 2.6/2.7:

>>> import HTMLParser
>>> HTMLParser.HTMLParser().unescape('©')
u'\xa9'

UPDATE: Python 3.4+:

>>> import html
>>> html.unescape('©')
'©'

See the 2.5 implementation vs the 2.6 implementation / 2.7 implementation

like image 121
DrMeers Avatar answered Apr 20 '23 18:04

DrMeers


Starting in python 3.9 using HTMLParser()unescape(<str>) will result in the error AttributeError: 'HTMLParser' object has no attribute 'unescape'

You can update it to:

import html
html.unescape(<str>)
like image 37
andorov Avatar answered Apr 20 '23 19:04

andorov