The solutions in other answers do not work when I try them, the same string outputs when I try those methods.
I am trying to do web scraping using Python 2.7. I have the webpage downloaded and it has some characters which are in the form x where 120 seems to represent the ascii code. I tried using HTMLParser() and decode() methods but nothing seems to work.
Please note that what I have from the webpage in the format are only those characters.
Example:
Blasterjaxx 
Please guide me to decode these strings using Python. I have read the other answers but the solutions don't seem to work for me.
The correct format for character reference is &#nnnn; so the ; is missing in your example. You can add the ; and then use HTMLParser.unescape() :
from HTMLParser import HTMLParser
import re
x ='Blasterjaxx '
x = re.sub(r'(&#[0-9]*)', r'\1;', x)
print x
h = HTMLParser()
print h.unescape(x)
This gives this output :
Blasterjaxx 
Blasterjaxx
Depending on what you're doing, you may wish to convert that data to valid HTML character references so you can parse it in context with a proper HTML parser.
However, it's easy enough to extract the number strings and convert them to the equivalent ASCII characters yourself. Eg,
s ='Blasterjaxx '
print ''.join([chr(int(u)) for u in s.split('&#') if u])
output
Blasterjaxx
The if u skips over the initial empty string that we get because s begins with the splitting string '&#'. Alternatively, we could skip it by slicing:
''.join([chr(int(u)) for u in s.split('&#')[1:]])
In Python 3, use the html module:
>>> import html
>>> html.unescape('Blasterjaxx ')
'Blasterjaxx '
docs: https://docs.python.org/3/library/html.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With