Decoding ampersand hash strings (&#124&#120&#97)etc

Question

The solutions in other answers do not work when I try them, the same string outputs when I try those methods.

I am trying to do web scraping using Python 2.7. I have the webpage downloaded and it has some characters which are in the form &#120 where 120 seems to represent the ascii code. I tried using HTMLParser() and decode() methods but nothing seems to work. Please note that what I have from the webpage in the format are only those characters. Example:

&#66&#108&#97&#115&#116&#101&#114&#106&#97&#120&#120&#32

Please guide me to decode these strings using Python. I have read the other answers but the solutions don't seem to work for me.

Fabich · Accepted Answer

The correct format for character reference is &#nnnn; so the ; is missing in your example. You can add the ; and then use HTMLParser.unescape() :

from HTMLParser import HTMLParser
import re
x ='&#66&#108&#97&#115&#116&#101&#114&#106&#97&#120&#120&#32'
x = re.sub(r'(&#[0-9]*)', r'\1;', x)
print x
h = HTMLParser()
print h.unescape(x)

This gives this output :

&#66;&#108;&#97;&#115;&#116;&#101;&#114;&#106;&#97;&#120;&#120;&#32;
Blasterjaxx

PM 2Ring · Answer

Depending on what you're doing, you may wish to convert that data to valid HTML character references so you can parse it in context with a proper HTML parser.

However, it's easy enough to extract the number strings and convert them to the equivalent ASCII characters yourself. Eg,

s ='&#66&#108&#97&#115&#116&#101&#114&#106&#97&#120&#120&#32'
print ''.join([chr(int(u)) for u in s.split('&#') if u])

output

Blasterjaxx

The if u skips over the initial empty string that we get because s begins with the splitting string '&#'. Alternatively, we could skip it by slicing:

''.join([chr(int(u)) for u in s.split('&#')[1:]])

frnhr · Answer

In Python 3, use the html module:

>>> import html
>>> html.unescape('&#66&#108&#97&#115&#116&#101&#114&#106&#97&#120&#120&#32')
'Blasterjaxx '

docs: https://docs.python.org/3/library/html.html

Decoding ampersand hash strings (&#124&#120&#97)etc

Tags:

python

html

decode

Ivankovich

3 Answers

Fabich

PM 2Ring

frnhr

Recent Activity

Donate For Us

Decoding ampersand hash strings (&#124&#120&#97)etc

Tags:

python

html

decode

Ivankovich

3 Answers

Fabich

PM 2Ring

frnhr

Related questions

Recent Activity

Donate For Us