Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Decoding ampersand hash strings (&#124&#120&#97)etc

The solutions in other answers do not work when I try them, the same string outputs when I try those methods.

I am trying to do web scraping using Python 2.7. I have the webpage downloaded and it has some characters which are in the form &#120 where 120 seems to represent the ascii code. I tried using HTMLParser() and decode() methods but nothing seems to work. Please note that what I have from the webpage in the format are only those characters. Example:

&#66&#108&#97&#115&#116&#101&#114&#106&#97&#120&#120&#32

Please guide me to decode these strings using Python. I have read the other answers but the solutions don't seem to work for me.

like image 662
Ivankovich Avatar asked Jul 20 '16 11:07

Ivankovich


3 Answers

The correct format for character reference is &#nnnn; so the ; is missing in your example. You can add the ; and then use HTMLParser.unescape() :

from HTMLParser import HTMLParser
import re
x ='&#66&#108&#97&#115&#116&#101&#114&#106&#97&#120&#120&#32'
x = re.sub(r'(&#[0-9]*)', r'\1;', x)
print x
h = HTMLParser()
print h.unescape(x)

This gives this output :

Blasterjaxx 
Blasterjaxx 
like image 81
Fabich Avatar answered Sep 18 '22 01:09

Fabich


Depending on what you're doing, you may wish to convert that data to valid HTML character references so you can parse it in context with a proper HTML parser.

However, it's easy enough to extract the number strings and convert them to the equivalent ASCII characters yourself. Eg,

s ='&#66&#108&#97&#115&#116&#101&#114&#106&#97&#120&#120&#32'
print ''.join([chr(int(u)) for u in s.split('&#') if u])

output

Blasterjaxx 

The if u skips over the initial empty string that we get because s begins with the splitting string '&#'. Alternatively, we could skip it by slicing:

''.join([chr(int(u)) for u in s.split('&#')[1:]])
like image 34
PM 2Ring Avatar answered Sep 18 '22 01:09

PM 2Ring


In Python 3, use the html module:

>>> import html
>>> html.unescape('&#66&#108&#97&#115&#116&#101&#114&#106&#97&#120&#120&#32')
'Blasterjaxx '

docs: https://docs.python.org/3/library/html.html

like image 31
frnhr Avatar answered Sep 19 '22 01:09

frnhr