Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Unicode Errors, sync up the dev environment and production

I've just run into a few unicode errors with a application I'm running, that every now and again has to deal with really odd strings, most recently,

Pınar Karsıyaka

in my dev environment (Aptana w. PyDev on Mavericks Mac with an up to date homebrew python install), dealing with this string doesn't produce an error, and is printed to the console as

P\u0131nar Kars\u0131yaka v Torku Selcuk

but on the production environment, standard Ubuntu and Python install on an Amazon EC2 small box, this is printed like

P\xc4\xb1nar Kars\xc4\xb1yaka v Torku Selcuk

and gives one of the dreaded Python errors,

UnicodeEncodeError: 'ascii' codec can't encode character u'\u0131' in position 50: ordinal not in range(128)

I would like to know how (if possible) to enable the prod environment to be able to deal with these characters, like my dev environment can, but also would like to be able to change my dev environment to break like the prod one, so that I can handle the occurrences where this happens within the code.

Thanks for any help in regards this.

Mac Python - Python 2.7.5 (default, Nov 1 2013, 18:38:34) [GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.2.79)] on darwin

Ubuntu Python - Python 2.7.3 (default, Apr 10 2013, 06:20:15) [GCC 4.6.3] on linux2

like image 349
seaders Avatar asked Oct 01 '22 05:10

seaders


2 Answers

If you dive a little in the 2.7 branch of the python sources, you will find out that the default encoding of unicode strings is firstly set to some forced value (now, it is "ascii", though it has been "utf-8" before then), but it is overridden by the site module during each instantiation of the interpreter.

To check the behaviour on each platform, run :

$ python -c 'import sys; print(sys.getdefaultencoding())'

Now, if you want to make them even, it is not really simple, because the function 'setdefaultencoding' is deleted in the site module, so that you have to reload the sys module to get it :

$ python -c 'import sys; reload(sys); sys.setdefaultencoding("utf-8"); print(sys.getdefaultencoding())'

That way, you can have the same encoding, in your interpreter, on each platform regardless of the locales and encoding defined at multiple levels from the os to the python build.

like image 125
piroux Avatar answered Oct 05 '22 07:10

piroux


Library versions

Please verify that all the library versions are the same, I suspect that there's an API change that returns unicode vs str from some external data source. I've seen these issues before when upgrading SQLObject and Cherrypy. Also data source settings are important, for example if you use a mysql server, you need to pay attention to default_encoding.

Your questions does not specify data source, it's hard to guess.

At the very least, do pip freeze in both enrivonments and compare the version numbers.

Default encoding

Check if there is sitecustomize.py in one of the environments -- that's the official way to set up any wonky things (which you shouldn't anyway, but that's another story).

It probably does exactly what @chocko01 suggests -- sets deafault encoding. Check it by logging sys.getdefaultencoding() in both environments.

Setting default encoding in Python makes conversion unicode<->str (Python2) and str<->bytes (Python3) transparent, but in the long run it's a bad idea. Remember that explicit is better than implicit.

Trace your data

it's a tough CCC to crack, but unless you can capture this particular problem in a reproducible test, the 2nd best is to dump tons of logs and then work your way backwards and see where your funky input comes from.

Then trace it downwards to determine where the difference is between your local and production environments.

At the point of error, it's unicode in your local env and UTF-8 encoded, aka str in production env. The fact that you have a sample for both environments suggests that you are able to reproduce the problem. Perhaps you should write an automated test as well.

like image 36
Dima Tisnek Avatar answered Oct 05 '22 08:10

Dima Tisnek