Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Decoding HTML entities with Python

I'm trying to decode HTML entries from here NYTimes.com and I cannot figure out what I am doing wrong.

Take for example:

"U.S. Adviser’s Blunt Memo on Iraq: Time ‘to Go Home’"

I've tried BeautifulSoup, decode('iso-8859-1'), and django.utils.encoding's smart_str without any success.

like image 319
KeyboardInterrupt Avatar asked Jul 30 '09 19:07

KeyboardInterrupt


People also ask

How do I decode a UTF 8 string in Python?

To decode a string encoded in UTF-8 format, we can use the decode() method specified on strings. This method accepts two arguments, encoding and error . encoding accepts the encoding of the string to be decoded, and error decides how to handle errors that arise during decoding.


2 Answers

>>> from HTMLParser import HTMLParser
>>> print HTMLParser().unescape('U.S. Adviser’s Blunt Memo on Iraq: '
...                             'Time ‘to Go Home’')
U.S. Adviser’s Blunt Memo on Iraq: Time ‘to Go Home’

The function is undocumented in Python 2. It is fixed in Python 3.4+: it is exposed as html.unescape() there.

like image 101
jfs Avatar answered Sep 22 '22 13:09

jfs


Actually what you have are not HTML entities. There are THREE varieties of those &.....; thingies -- for example       all mean U+00A0 NO-BREAK SPACE.

  (the type you have) is a "numeric character reference" (decimal).
  is a "numeric character reference" (hexadecimal).
  is an entity.

Further reading: http://htmlhelp.com/reference/html40/entities/

Here you will find code for Python2.x that does all three in one scan through the input: http://effbot.org/zone/re-sub.htm#unescape-html

like image 39
John Machin Avatar answered Sep 25 '22 13:09

John Machin