How do I decode strings such as this one "weren\xe2\x80\x99t" back to the normal encoding.
So this word is actually weren't and not "weren\xe2\x80\x99t"? For example:
print "\xe2\x80\x9cThings"
string = "\xe2\x80\x9cThings"
print string.decode('utf-8')
print string.encode('ascii', 'ignore')
“Things
“Things
Things
But I actually want to get "Things.
or:
print "weren\xe2\x80\x99t"
string = "weren\xe2\x80\x99t"
print string.decode('utf-8')
print string.encode('ascii', 'ignore')
weren’t
weren’t
werent
But I actually want to get weren't.
How should i do this?
The Python "UnicodeDecodeError: 'ascii' codec can't decode byte in position" occurs when we use the ascii codec to decode bytes that were encoded using a different codec. To solve the error, specify the correct encoding, e.g. utf-8 .
Python bytes decode() function is used to convert bytes to string object. Both these functions allow us to specify the error handling scheme to use for encoding/decoding errors. The default is 'strict' meaning that encoding errors raise a UnicodeEncodeError.
decode() is a method specified in Strings in Python 2. This method is used to convert from one encoding scheme, in which argument string is encoded to the desired encoding scheme. This works opposite to the encode. It accepts the encoding of the encoding string to decode it and returns the original string.
UTF-8 is a byte oriented encoding. The encoding specifies that each character is represented by a specific sequence of one or more bytes.
I mapped the most common strange chars so this is pretty much complete answer based on the Oliver W. answer.
This function is by no means ideal,but it is the best place to start with. There are more chars definitions:
http://utf8-chartable.de/unicode-utf8-table.pl?start=8192&number=128&utf8=string
http://www.utf8-chartable.de/unicode-utf8-table.pl?start=128&number=128&names=-&utf8=string-literal
...
def unicodetoascii(text):
uni2ascii = {
ord('\xe2\x80\x99'.decode('utf-8')): ord("'"),
ord('\xe2\x80\x9c'.decode('utf-8')): ord('"'),
ord('\xe2\x80\x9d'.decode('utf-8')): ord('"'),
ord('\xe2\x80\x9e'.decode('utf-8')): ord('"'),
ord('\xe2\x80\x9f'.decode('utf-8')): ord('"'),
ord('\xc3\xa9'.decode('utf-8')): ord('e'),
ord('\xe2\x80\x9c'.decode('utf-8')): ord('"'),
ord('\xe2\x80\x93'.decode('utf-8')): ord('-'),
ord('\xe2\x80\x92'.decode('utf-8')): ord('-'),
ord('\xe2\x80\x94'.decode('utf-8')): ord('-'),
ord('\xe2\x80\x94'.decode('utf-8')): ord('-'),
ord('\xe2\x80\x98'.decode('utf-8')): ord("'"),
ord('\xe2\x80\x9b'.decode('utf-8')): ord("'"),
ord('\xe2\x80\x90'.decode('utf-8')): ord('-'),
ord('\xe2\x80\x91'.decode('utf-8')): ord('-'),
ord('\xe2\x80\xb2'.decode('utf-8')): ord("'"),
ord('\xe2\x80\xb3'.decode('utf-8')): ord("'"),
ord('\xe2\x80\xb4'.decode('utf-8')): ord("'"),
ord('\xe2\x80\xb5'.decode('utf-8')): ord("'"),
ord('\xe2\x80\xb6'.decode('utf-8')): ord("'"),
ord('\xe2\x80\xb7'.decode('utf-8')): ord("'"),
ord('\xe2\x81\xba'.decode('utf-8')): ord("+"),
ord('\xe2\x81\xbb'.decode('utf-8')): ord("-"),
ord('\xe2\x81\xbc'.decode('utf-8')): ord("="),
ord('\xe2\x81\xbd'.decode('utf-8')): ord("("),
ord('\xe2\x81\xbe'.decode('utf-8')): ord(")"),
}
return text.decode('utf-8').translate(uni2ascii).encode('ascii')
print unicodetoascii("weren\xe2\x80\x99t")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With