Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to translate an unprintable string to a normal string in python?

I have a string (that originally is taken from a search result of a search engine) that contains special characters such as '\xe9' and I just want to replace those characters to normal characters so that I could print them (it's a python program).

So how do I do it? It keeps writing me this error: " File "D:\Python27\lib\encodings\cp1255.py", line 12, in encode return codecs.charmap_encode(input,errors,encoding_table) UnicodeEncodeError: 'charmap' codec can't encode character u'\xe9' in position 11: character maps to undefined"

By the way, when I print "sys.getdefaultencoding()" it prints : Cp1255

The error originally happens in this function call: "urllib.urlencode(THE STRING)" but it also happens when I try to write "print (firstSearch['Results'][i]['Title'])" where firstSearch is a JSON that I built from the search results of the search engine...

tnx, Itamar.

like image 894
Itamar Avatar asked May 02 '26 21:05

Itamar


2 Answers

Use the codecs module to transform a given string into an encoding that you can further use (e.g. print, or pass to another function). The safest encoding for arbitrary purposes is of course ASCII, but it's also the one with the most loss.

E.g.

s = "\xe9 and other stuff"
s1 = codecs.encode(codecs.decode(s,'<source-encoding>', 'replace'), 'utf-8')

This will decode your source string into a unicode string from the encoding is it in (You need to check which encoding the search engine returns). The replace argument allows to replace unknown characters with '?' (which is loss of information), but there are other options as well, check the docs.

The result is then encoded into the target encoding, here for example utf-8, which is ok if e.g. you want to print the string on a terminal that supports this encoding. If you want to further process the result string, I would recommend to stick with Unicode as long as possible.

Two things to note here:

  • You need to know what your input string's encoding is.
  • You need to know what encoding the target function can handle. This might be different for 'print' (ascii?) and 'urllib.urlencode' (unicode?).

NB: The .encode and .decode functions are also available as string methods, so you can write s.decode(...) etc.

like image 50
ThomasH Avatar answered May 05 '26 11:05

ThomasH


It appears that you are on a Windows machine, in a Hebrew locale, with the default encoding being cp1255 which uses the hi-bit-set characters to support the Hebrew script, not Western European characters like u'\xe9' which is LATIN SMALL LETTER E WITH ACUTE.

You should be able to do

print u'\xe9'

in IDLE and observe e-acute being printed.

Note: str(some_unicode_string) is only of practical use (i.e. supports ALL Unicode characters) if the default encoding is UTF-something (usually UTF-8) or GB18030. On Windows machines, it's usually ascii. Yours is 'cp1255', which is not OK for arbitrary Unicode characters.

Update after new information provided in comments:

For your urllib.urlencode() problem: That function expects a str object. You are supplying a unicode object. Python 2.x attempts to encode using the system default encoding (cp1255 in your case). cp1255 doesn't handle u'\xe9', hence the error message. You need to ascertain what encoding is expected by the website with which you are communicating. With luck, it's UTF-8. Instead of passing the_unicode_string, pass the_unicode_string.encode(website_expected_encoding). If the expected encoding is cp1255 or some other encoding that doesn't support all the unicode characters that are returned by your queries (on a different site? same site???) then you are seriously out of luck and/or you need to examine carefully how you got those unicode strings in the first place. See this answer by @bobince ... ignore the accepted answer which is much less informative.

like image 25
John Machin Avatar answered May 05 '26 09:05

John Machin