I've just run into a few unicode errors with a application I'm running, that every now and again has to deal with really odd strings, most recently,
Pınar Karsıyaka
in my dev environment (Aptana w. PyDev on Mavericks Mac with an up to date homebrew python install), dealing with this string doesn't produce an error, and is printed to the console as
P\u0131nar Kars\u0131yaka v Torku Selcuk
but on the production environment, standard Ubuntu and Python install on an Amazon EC2 small box, this is printed like
P\xc4\xb1nar Kars\xc4\xb1yaka v Torku Selcuk
and gives one of the dreaded Python errors,
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0131' in position 50: ordinal not in range(128)
I would like to know how (if possible) to enable the prod environment to be able to deal with these characters, like my dev environment can, but also would like to be able to change my dev environment to break like the prod one, so that I can handle the occurrences where this happens within the code.
Thanks for any help in regards this.
Mac Python - Python 2.7.5 (default, Nov 1 2013, 18:38:34) [GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.2.79)] on darwin
Ubuntu Python - Python 2.7.3 (default, Apr 10 2013, 06:20:15) [GCC 4.6.3] on linux2
If you dive a little in the 2.7 branch of the python sources, you will find out that the default encoding of unicode strings is firstly set to some forced value (now, it is "ascii", though it has been "utf-8" before then), but it is overridden by the site module during each instantiation of the interpreter.
To check the behaviour on each platform, run :
$ python -c 'import sys; print(sys.getdefaultencoding())'
Now, if you want to make them even, it is not really simple, because the function 'setdefaultencoding' is deleted in the site module, so that you have to reload the sys module to get it :
$ python -c 'import sys; reload(sys); sys.setdefaultencoding("utf-8"); print(sys.getdefaultencoding())'
That way, you can have the same encoding, in your interpreter, on each platform regardless of the locales and encoding defined at multiple levels from the os to the python build.
Library versions
Please verify that all the library versions are the same, I suspect that there's an API change that returns unicode
vs str
from some external data source. I've seen these issues before when upgrading SQLObject
and Cherrypy
. Also data source settings are important, for example if you use a mysql
server, you need to pay attention to default_encoding
.
Your questions does not specify data source, it's hard to guess.
At the very least, do pip freeze
in both enrivonments and compare the version numbers.
Default encoding
Check if there is sitecustomize.py
in one of the environments -- that's the official way to set up any wonky things (which you shouldn't anyway, but that's another story).
It probably does exactly what @chocko01 suggests -- sets deafault encoding. Check it by logging sys.getdefaultencoding()
in both environments.
Setting default encoding in Python makes conversion unicode<->str
(Python2) and str<->bytes
(Python3) transparent, but in the long run it's a bad idea. Remember that explicit is better than implicit
.
Trace your data
it's a tough CCC to crack, but unless you can capture this particular problem in a reproducible test, the 2nd best is to dump tons of logs and then work your way backwards and see where your funky input comes from.
Then trace it downwards to determine where the difference is between your local and production environments.
At the point of error, it's unicode
in your local env and UTF-8
encoded, aka str
in production env. The fact that you have a sample for both environments suggests that you are able to reproduce the problem. Perhaps you should write an automated test as well.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With