Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert numeric character reference notation to unicode string

Is there a standard, preferably Pythonic, way to convert the &#xxxx; notation to a proper unicode string?

For example,

מפגשי

Should be converted to:

מפגשי

It can be done - quite easily - using string manipulations, but I wonder if there's a standard library for this.

like image 599
Adam Matan Avatar asked Jun 10 '13 07:06

Adam Matan


1 Answers

Use HTMLParser.HTMLParser():

>>> from HTMLParser import HTMLParser
>>> h = HTMLParser()
>>> s = "מפגשי"
>>> print h.unescape(s)
מפגשי

It's part of the standard library, too.


However, if you're using Python 3, you have to import from html.parser:

>>> from html.parser import HTMLParser
>>> h = HTMLParser()
>>> s = 'מפגשי'
>>> print(h.unescape(s))
מפגשי
like image 157
TerryA Avatar answered Sep 20 '22 14:09

TerryA