Convert XML/HTML Entities into Unicode String in Python [duplicate]

Question

I'm doing some web scraping and sites frequently use HTML entities to represent non ascii characters. Does Python have a utility that takes a string with HTML entities and returns a unicode type?

For example:

I get back:

&#x01ce;

which represents an "ǎ" with a tone mark. In binary, this is represented as the 16 bit 01ce. I want to convert the html entity into the value u'\u01ce'

Vladislav · Accepted Answer

The standard lib’s very own HTMLParser has an undocumented function unescape() which does exactly what you think it does:

up to Python 3.4:

import HTMLParser h = HTMLParser.HTMLParser() h.unescape('&copy; 2010') # u'\xa9 2010' h.unescape('&#169; 2010') # u'\xa9 2010'

Python 3.4+:

import html html.unescape('&copy; 2010') # u'\xa9 2010' html.unescape('&#169; 2010') # u'\xa9 2010'

dF. · Answer

Python has the htmlentitydefs module, but this doesn't include a function to unescape HTML entities.

Python developer Fredrik Lundh (author of elementtree, among other things) has such a function on his website, which works with decimal, hex and named entities:

import re, htmlentitydefs  ## # Removes HTML or XML character references and entities from a text string. # # @param text The HTML (or XML) source text. # @return The plain text, as a Unicode string, if necessary.  def unescape(text):     def fixup(m):         text = m.group(0)         if text[:2] == "&#":             # character reference             try:                 if text[:3] == "&#x":                     return unichr(int(text[3:-1], 16))                 else:                     return unichr(int(text[2:-1]))             except ValueError:                 pass         else:             # named entity             try:                 text = unichr(htmlentitydefs.name2codepoint[text[1:-1]])             except KeyError:                 pass         return text # leave as is     return re.sub("&#?\w+;", fixup, text)

Convert XML/HTML Entities into Unicode String in Python [duplicate]

Tags:

python

html

entities

Cristian

2 Answers

Vladislav

dF.

Recent Activity

Donate For Us

Convert XML/HTML Entities into Unicode String in Python [duplicate]

Tags:

python

html

entities

Cristian

2 Answers

Vladislav

dF.

Related questions

Recent Activity

Donate For Us