Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert XML/HTML Entities into Unicode String in Python [duplicate]

I'm doing some web scraping and sites frequently use HTML entities to represent non ascii characters. Does Python have a utility that takes a string with HTML entities and returns a unicode type?

For example:

I get back:

ǎ 

which represents an "ǎ" with a tone mark. In binary, this is represented as the 16 bit 01ce. I want to convert the html entity into the value u'\u01ce'

like image 993
Cristian Avatar asked Sep 11 '08 21:09

Cristian


2 Answers

The standard lib’s very own HTMLParser has an undocumented function unescape() which does exactly what you think it does:

up to Python 3.4:

import HTMLParser h = HTMLParser.HTMLParser() h.unescape('© 2010') # u'\xa9 2010' h.unescape('© 2010') # u'\xa9 2010' 

Python 3.4+:

import html html.unescape('© 2010') # u'\xa9 2010' html.unescape('© 2010') # u'\xa9 2010' 
like image 143
Vladislav Avatar answered Sep 20 '22 00:09

Vladislav


Python has the htmlentitydefs module, but this doesn't include a function to unescape HTML entities.

Python developer Fredrik Lundh (author of elementtree, among other things) has such a function on his website, which works with decimal, hex and named entities:

import re, htmlentitydefs  ## # Removes HTML or XML character references and entities from a text string. # # @param text The HTML (or XML) source text. # @return The plain text, as a Unicode string, if necessary.  def unescape(text):     def fixup(m):         text = m.group(0)         if text[:2] == "&#":             # character reference             try:                 if text[:3] == "&#x":                     return unichr(int(text[3:-1], 16))                 else:                     return unichr(int(text[2:-1]))             except ValueError:                 pass         else:             # named entity             try:                 text = unichr(htmlentitydefs.name2codepoint[text[1:-1]])             except KeyError:                 pass         return text # leave as is     return re.sub("&#?\w+;", fixup, text) 
like image 27
dF. Avatar answered Sep 22 '22 00:09

dF.