I'm doing some web scraping and sites frequently use HTML entities to represent non ascii characters. Does Python have a utility that takes a string with HTML entities and returns a unicode type?
For example:
I get back:
ǎ   which represents an "ǎ" with a tone mark.  In binary, this is represented as the 16 bit 01ce.  I want to convert the html entity into the value  u'\u01ce'
The standard lib’s very own HTMLParser has an undocumented function unescape() which does exactly what you think it does:
up to Python 3.4:
import HTMLParser h = HTMLParser.HTMLParser() h.unescape('© 2010') # u'\xa9 2010' h.unescape('© 2010') # u'\xa9 2010'   Python 3.4+:
import html html.unescape('© 2010') # u'\xa9 2010' html.unescape('© 2010') # u'\xa9 2010' 
                        Python has the htmlentitydefs module, but this doesn't include a function to unescape HTML entities.
Python developer Fredrik Lundh (author of elementtree, among other things) has such a function on his website, which works with decimal, hex and named entities:
import re, htmlentitydefs  ## # Removes HTML or XML character references and entities from a text string. # # @param text The HTML (or XML) source text. # @return The plain text, as a Unicode string, if necessary.  def unescape(text):     def fixup(m):         text = m.group(0)         if text[:2] == "&#":             # character reference             try:                 if text[:3] == "&#x":                     return unichr(int(text[3:-1], 16))                 else:                     return unichr(int(text[2:-1]))             except ValueError:                 pass         else:             # named entity             try:                 text = unichr(htmlentitydefs.name2codepoint[text[1:-1]])             except KeyError:                 pass         return text # leave as is     return re.sub("&#?\w+;", fixup, text) 
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With