I'm parsing some HTML with Beautiful Soup 3, but it contains HTML entities which Beautiful Soup 3 doesn't automatically decode for me:
>>> from BeautifulSoup import BeautifulSoup >>> soup = BeautifulSoup("<p>£682m</p>") >>> text = soup.find("p").string >>> print text £682m
How can I decode the HTML entities in text
to get "£682m"
instead of "£682m"
.
To decode a string encoded in UTF-8 format, we can use the decode() method specified on strings. This method accepts two arguments, encoding and error . encoding accepts the encoding of the string to be decoded, and error decides how to handle errors that arise during decoding.
Use html.unescape()
:
import html print(html.unescape('£682m'))
FYI html.parser.HTMLParser.unescape
is deprecated, and was supposed to be removed in 3.5, although it was left in by mistake. It will be removed from the language soon.
You can use HTMLParser.unescape()
from the standard library:
HTMLParser
html.parser
>>> try: ... # Python 2.6-2.7 ... from HTMLParser import HTMLParser ... except ImportError: ... # Python 3 ... from html.parser import HTMLParser ... >>> h = HTMLParser() >>> print(h.unescape('£682m')) £682m
You can also use the six
compatibility library to simplify the import:
>>> from six.moves.html_parser import HTMLParser >>> h = HTMLParser() >>> print(h.unescape('£682m')) £682m
Beautiful Soup handles entity conversion. In Beautiful Soup 3, you'll need to specify the convertEntities
argument to the BeautifulSoup
constructor (see the 'Entity Conversion' section of the archived docs). In Beautiful Soup 4, entities get decoded automatically.
>>> from BeautifulSoup import BeautifulSoup >>> BeautifulSoup("<p>£682m</p>", ... convertEntities=BeautifulSoup.HTML_ENTITIES) <p>£682m</p>
>>> from bs4 import BeautifulSoup >>> BeautifulSoup("<p>£682m</p>") <html><body><p>£682m</p></body></html>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With