I'm parsing some HTML with Beautiful Soup 3, but it contains HTML entities which Beautiful Soup 3 doesn't automatically decode for me:
>>> from BeautifulSoup import BeautifulSoup  >>> soup = BeautifulSoup("<p>£682m</p>") >>> text = soup.find("p").string  >>> print text £682m   How can I decode the HTML entities in text to get "£682m" instead of "£682m".
To decode a string encoded in UTF-8 format, we can use the decode() method specified on strings. This method accepts two arguments, encoding and error . encoding accepts the encoding of the string to be decoded, and error decides how to handle errors that arise during decoding.
Use html.unescape():
import html print(html.unescape('£682m'))   FYI html.parser.HTMLParser.unescape is deprecated, and was supposed to be removed in 3.5, although it was left in by mistake. It will be removed from the language soon.
You can use HTMLParser.unescape() from the standard library:
HTMLParser html.parser >>> try: ...     # Python 2.6-2.7  ...     from HTMLParser import HTMLParser ... except ImportError: ...     # Python 3 ...     from html.parser import HTMLParser ...  >>> h = HTMLParser() >>> print(h.unescape('£682m')) £682m   You can also use the six compatibility library to simplify the import:
>>> from six.moves.html_parser import HTMLParser >>> h = HTMLParser() >>> print(h.unescape('£682m')) £682m 
                        Beautiful Soup handles entity conversion. In Beautiful Soup 3, you'll need to specify the convertEntities argument to the BeautifulSoup constructor (see the 'Entity Conversion' section of the archived docs). In Beautiful Soup 4, entities get decoded automatically.
>>> from BeautifulSoup import BeautifulSoup >>> BeautifulSoup("<p>£682m</p>",  ...               convertEntities=BeautifulSoup.HTML_ENTITIES) <p>£682m</p>   >>> from bs4 import BeautifulSoup >>> BeautifulSoup("<p>£682m</p>") <html><body><p>£682m</p></body></html> 
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With