I have a string (that originally is taken from a search result of a search engine) that contains special characters such as '\xe9' and I just want to replace those characters to normal characters so that I could print them (it's a python program).
So how do I do it? It keeps writing me this error: " File "D:\Python27\lib\encodings\cp1255.py", line 12, in encode return codecs.charmap_encode(input,errors,encoding_table) UnicodeEncodeError: 'charmap' codec can't encode character u'\xe9' in position 11: character maps to undefined"
By the way, when I print "sys.getdefaultencoding()" it prints : Cp1255
The error originally happens in this function call: "urllib.urlencode(THE STRING)" but it also happens when I try to write "print (firstSearch['Results'][i]['Title'])" where firstSearch is a JSON that I built from the search results of the search engine...
tnx, Itamar.
Use the codecs module to transform a given string into an encoding that you can further use (e.g. print, or pass to another function). The safest encoding for arbitrary purposes is of course ASCII, but it's also the one with the most loss.
E.g.
s = "\xe9 and other stuff"
s1 = codecs.encode(codecs.decode(s,'<source-encoding>', 'replace'), 'utf-8')
This will decode your source string into a unicode string from the encoding is it in (You need to check which encoding the search engine returns). The replace argument allows to replace unknown characters with '?' (which is loss of information), but there are other options as well, check the docs.
The result is then encoded into the target encoding, here for example utf-8, which is ok if e.g. you want to print the string on a terminal that supports this encoding. If you want to further process the result string, I would recommend to stick with Unicode as long as possible.
Two things to note here:
NB: The .encode and .decode functions are also available as string methods, so you can write s.decode(...) etc.
It appears that you are on a Windows machine, in a Hebrew locale, with the default encoding being cp1255 which uses the hi-bit-set characters to support the Hebrew script, not Western European characters like u'\xe9' which is LATIN SMALL LETTER E WITH ACUTE.
You should be able to do
print u'\xe9'
in IDLE and observe e-acute being printed.
Note: str(some_unicode_string) is only of practical use (i.e. supports ALL Unicode characters) if the default encoding is UTF-something (usually UTF-8) or GB18030. On Windows machines, it's usually ascii. Yours is 'cp1255', which is not OK for arbitrary Unicode characters.
Update after new information provided in comments:
For your urllib.urlencode() problem: That function expects a str object. You are supplying a unicode object. Python 2.x attempts to encode using the system default encoding (cp1255 in your case). cp1255 doesn't handle u'\xe9', hence the error message. You need to ascertain what encoding is expected by the website with which you are communicating. With luck, it's UTF-8. Instead of passing the_unicode_string, pass the_unicode_string.encode(website_expected_encoding). If the expected encoding is cp1255 or some other encoding that doesn't support all the unicode characters that are returned by your queries (on a different site? same site???) then you are seriously out of luck and/or you need to examine carefully how you got those unicode strings in the first place. See this answer by @bobince ... ignore the accepted answer which is much less informative.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With