Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Decode HTML entities in Python string?

I'm parsing some HTML with Beautiful Soup 3, but it contains HTML entities which Beautiful Soup 3 doesn't automatically decode for me:

>>> from BeautifulSoup import BeautifulSoup  >>> soup = BeautifulSoup("<p>&pound;682m</p>") >>> text = soup.find("p").string  >>> print text &pound;682m 

How can I decode the HTML entities in text to get "£682m" instead of "&pound;682m".

like image 368
jkp Avatar asked Jan 18 '10 16:01

jkp


People also ask

How do I decode a UTF 8 string in Python?

To decode a string encoded in UTF-8 format, we can use the decode() method specified on strings. This method accepts two arguments, encoding and error . encoding accepts the encoding of the string to be decoded, and error decides how to handle errors that arise during decoding.


2 Answers

Python 3.4+

Use html.unescape():

import html print(html.unescape('&pound;682m')) 

FYI html.parser.HTMLParser.unescape is deprecated, and was supposed to be removed in 3.5, although it was left in by mistake. It will be removed from the language soon.


Python 2.6-3.3

You can use HTMLParser.unescape() from the standard library:

  • For Python 2.6-2.7 it's in HTMLParser
  • For Python 3 it's in html.parser
>>> try: ...     # Python 2.6-2.7  ...     from HTMLParser import HTMLParser ... except ImportError: ...     # Python 3 ...     from html.parser import HTMLParser ...  >>> h = HTMLParser() >>> print(h.unescape('&pound;682m')) £682m 

You can also use the six compatibility library to simplify the import:

>>> from six.moves.html_parser import HTMLParser >>> h = HTMLParser() >>> print(h.unescape('&pound;682m')) £682m 
like image 179
luc Avatar answered Oct 02 '22 16:10

luc


Beautiful Soup handles entity conversion. In Beautiful Soup 3, you'll need to specify the convertEntities argument to the BeautifulSoup constructor (see the 'Entity Conversion' section of the archived docs). In Beautiful Soup 4, entities get decoded automatically.

Beautiful Soup 3

>>> from BeautifulSoup import BeautifulSoup >>> BeautifulSoup("<p>&pound;682m</p>",  ...               convertEntities=BeautifulSoup.HTML_ENTITIES) <p>£682m</p> 

Beautiful Soup 4

>>> from bs4 import BeautifulSoup >>> BeautifulSoup("<p>&pound;682m</p>") <html><body><p>£682m</p></body></html> 
like image 23
Ben James Avatar answered Oct 02 '22 15:10

Ben James