The solutions in other answers do not work when I try them, the same string outputs when I try those methods.
I am trying to do web scraping using Python 2.7. I have the webpage downloaded and it has some characters which are in the form x
where 120 seems to represent the ascii code. I tried using HTMLParser()
and decode()
methods but nothing seems to work.
Please note that what I have from the webpage in the format are only those characters.
Example:
Blasterjaxx 
Please guide me to decode these strings using Python. I have read the other answers but the solutions don't seem to work for me.
The correct format for character reference is &#nnnn;
so the ;
is missing in your example. You can add the ;
and then use HTMLParser.unescape() :
from HTMLParser import HTMLParser
import re
x ='Blasterjaxx '
x = re.sub(r'(&#[0-9]*)', r'\1;', x)
print x
h = HTMLParser()
print h.unescape(x)
This gives this output :
Blasterjaxx 
Blasterjaxx
Depending on what you're doing, you may wish to convert that data to valid HTML character references so you can parse it in context with a proper HTML parser.
However, it's easy enough to extract the number strings and convert them to the equivalent ASCII characters yourself. Eg,
s ='Blasterjaxx '
print ''.join([chr(int(u)) for u in s.split('&#') if u])
output
Blasterjaxx
The if u
skips over the initial empty string that we get because s
begins with the splitting string '&#'
. Alternatively, we could skip it by slicing:
''.join([chr(int(u)) for u in s.split('&#')[1:]])
In Python 3, use the html
module:
>>> import html
>>> html.unescape('Blasterjaxx ')
'Blasterjaxx '
docs: https://docs.python.org/3/library/html.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With