Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dangers of sys.setdefaultencoding('utf-8')

There is a trend of discouraging setting sys.setdefaultencoding('utf-8') in Python 2. Can anybody list real examples of problems with that? Arguments like it is harmful or it hides bugs don't sound very convincing.

UPDATE: Please note that this question is only about utf-8, it is not about changing default encoding "in general case".

Please give some examples with code if you can.

like image 745
anatoly techtonik Avatar asked Feb 22 '15 10:02

anatoly techtonik


People also ask

How do I export from Pythonioencoding To UTF 8?

The PYTHONIOENCODING environment variable controls what encoding stdout/stderr uses. If you do an export PYTHONIOENCODING=UTF-8, it will solve the problem. You can also prefix any given single python command-line invocation with this value, such as PYTHONIOENCODING=UTF-8 ./manage.py runserver.

What is Python default encoding?

UTF-8 is one of the most commonly used encodings, and Python often defaults to using it. UTF stands for “Unicode Transformation Format”, and the '8' means that 8-bit values are used in the encoding. (There are also UTF-16 and UTF-32 encodings, but they are less frequently used than UTF-8.)

How do I change the default encoding in Python 3?

setdefaultencoding() is purposely removed from sys when Python starts. Reenabling it and changing the default encoding can break code that relies on ASCII being the default (this code can be third-party, which would generally make fixing it impossible or dangerous).


2 Answers

The original poster asked for code which demonstrates that the switch is harmful—except that it "hides" bugs unrelated to the switch.

Updates

  • [2020-11-01]: pip install setdefaultencoding
    Eradicates the need to reload(sys) (from Thomas Grainger).

  • [2019]: Personal experience with python3:

    • No unicode en/decoding problems. Reasons:
    • Got used to writing .encode('utf-8') .decode('utf-8') a (felt) 100 times a day.
    • Looking into libraries: Same. 'utf-8' either hardcoded or the silent default, in pretty much all the I/O done
    • Heavily improved byte strings support made it finally possible to convert I/O centric applications like mercurial.
    • Having to write .encode and .decode all the time got people aware of the difference between strings for humans and machines.

In my opinion, python2's bytestrings combined with (utf-8 default) decoding only before outputting to humans or unicode only formats would have been the technical superior approach, compared to decoding/encoding everything at ingress and at egress w/o actual need many many times. It depends on the application if something like the len() function is more practical, when returning the character count for humans, compared to returning the bytes used to store and forward by machines.

=> I think it's safe to say that UTF-8 everywhere saved the Unicode Sandwich Design.
Without that many libraries and applications, which only pass through strings w/o interpreting them could not work.

Summary of conclusions

(from 2017)

Based on both experience and evidence I've collected, here are the conclusions I've arrived at.

  1. Setting the defaultencoding to UTF-8 nowadays is safe, except for specialised applications, handling files from non unicode ready systems.

  2. The "official" rejection of the switch is based on reasons no longer relevant for a vast majority of end users (not library providers), so we should stop discouraging users to set it.

  3. Working in a model that handles Unicode properly by default is far better suited for applications for inter-systems communications than manually working with unicode APIs.

Effectively, modifying the default encoding very frequently avoids a number of user headaches in the vast majority of use cases. Yes, there are situations in which programs dealing with multiple encodings will silently misbehave, but since this switch can be enabled piecemeal, this is not a problem in end-user code.

More importantly, enabling this flag is a real advantage is users' code, both by reducing the overhead of having to manually handle Unicode conversions, cluttering the code and making it less readable, but also by avoiding potential bugs when the programmer fails to do this properly in all cases.


Since these claims are pretty much the exact opposite of Python's official line of communication, I think the an explanation for these conclusions is warranted.

Examples of successfully using a modified defaultencoding in the wild

  1. Dave Malcom of Fedora believed it is always right. He proposed, after investigating risks, to change distribution wide def.enc.=UTF-8 for all Fedora users.

    Hard fact presented though why Python would break is only the hashing behavior I listed, which is never picked up by any other opponent within the core community as a reason to worry about or even by the same person, when working on user tickets.

    Resume of Fedora: Admittedly, the change itself was described as "wildly unpopular" with the core developers, and it was accused of being inconsistent with previous versions.

  2. There are 3000 projects alone at openhub doing it. They have a slow search frontend, but scanning over it, I estimate 98% are using UTF-8. Nothing found about nasty surprises.

  3. There are 18000(!) github master branches with it changed.

    While the change is "unpopular" at the core community its pretty popular in the user base. Though this could be disregarded, since users are known to use hacky solutions, I don't think this is a relevant argument due to my next point.

  4. There are only 150 bugreports total on GitHub due to this. At a rate of effectively 100%, the change seems to be positive, not negative.

    To summarize the existing issues people have run into, I've scanned through all of the aforementioned tickets.

    • Chaging def.enc. to UTF-8 is typically introduced but not removed in the issue closing process, most often as a solution. Some bigger ones excusing it as temporary fix, considering the "bad press" it has, but far more bug reporters are justglad about the fix.

    • A few (1-5?) projects modified their code doing the type conversions manually so that they did not need to change the default anymore.

    • In two instances I see someone claiming that with def.enc. set to UTF-8 leads to a complete lack of output entirely, without explaining the test setup. I could not verify the claim, and I tested one and found the opposite to be true.

    • One claims his "system" might depend on not changing it but we do not learn why.

    • One (and only one) had a real reason to avoid it: ipython either uses a 3rd party module or the test runner modified their process in an uncontrolled way (it is never disputed that a def.enc. change is advocated by its proponents only at interpreter setup time, i.e. when 'owning' the process).

  5. I found zero indication that the different hashes of 'é' and u'é' causes problems in real-world code.

  6. Python does not "break"

    After changing the setting to UTF-8, no feature of Python covered by unit tests is working any differently than without the switch. The switch itself, though, is not tested at all.

  7. It is advised on bugs.python.org to frustrated users

    Examples here, here or here (often connected with the official line of warning)

    The first one demonstrates how established the switch is in Asia (compare also with the github argument).

  8. Ian Bicking published his support for always enabling this behavior.

    I can make my systems and communications consistently UTF-8, things will just get better. I really don't see a downside. But why does Python make it SO DAMN HARD [...] I feel like someone decided they were smarter than me, but I'm not sure I believe them.

  9. Martijn Fassen, while refuting Ian, admitted that ASCII might have been wrong in the first place.

    I believe if, say, Python 2.5, shipped with a default encoding of UTF-8, it wouldn't actually break anything. But if I did it for my Python, I'd have problems soon as I gave my code to someone else.

  10. In Python3, they don't "practice what they preach"

    While opposing any def.enc. change so harshly because of environment dependent code or implicitness, a discussion here revolves about Python3's problems with its 'unicode sandwich' paradigm and the corresponding required implicit assumptions.

    Further they created possibilities to write valid Python3 code like:

     >>> from 褐褑褒褓褔褕褖褗褘 import *          >>> def 空手(合氣道): あいき(ど(合氣道))  >>> 空手(う힑힜('👏 ') + 흾)  💔 
  11. DiveIntoPython recommends it.

  12. In this thread, Guido himself advises a professional end user to use a process specific environt with the switch set to "create a custom Python environment for each project."

    The fundamental reason the designers of Python's 2.x standard library don't want you to be able to set the default encoding in your app, is that the standard library is written with the assumption that the default encoding is fixed, and no guarantees about the correct workings of the standard library can be made when you change it. There are no tests for this situation. Nobody knows what will fail when. And you (or worse, your users) will come back to us with complaints if the standard library suddenly starts doing things you didn't expect.

  13. Jython offers to change it on the fly, even in modules.

  14. PyPy did not support reload(sys) - but brought it back on user request within a single day without questions asked. Compare with the "you are doing it wrong" attitude of CPython, claiming without proof it is the "root of evil".


Ending this list I confirm that one could construct a module which crashes because of a changed interpreter config, doing something like this:

def is_clean_ascii(s):     """ [Stupid] type agnostic checker if only ASCII chars are contained in s"""     try:         unicode(str(s))         # we end here also for NON ascii if the def.enc. was changed         return True     except Exception, ex:         return False      if is_clean_ascii(mystr):     <code relying on mystr to be ASCII> 

I don't think this is a valid argument because the person who wrote this dual type accepting module was obviously aware about ASCII vs. non ASCII strings and would be aware of encoding and decoding.

I think this evidence is more than enough indication that changing this setting does not lead to any problems in real world codebases the vast majority of the time.

like image 84
10 revs, 3 users 78% Avatar answered Oct 07 '22 06:10

10 revs, 3 users 78%


Because you don't always want to have your strings automatically decoded to Unicode, or for that matter your Unicode objects automatically encoded to bytes. Since you are asking for a concrete example, here is one:

Take a WSGI web application; you are building a response by adding the product of an external process to a list, in a loop, and that external process gives you UTF-8 encoded bytes:

results = [] content_length = 0  for somevar in some_iterable:     output = some_process_that_produces_utf8(somevar)     content_length += len(output)     results.append(output)  headers = {     'Content-Length': str(content_length),     'Content-Type': 'text/html; charset=utf8', } start_response(200, headers) return results 

That's great and fine and works. But then your co-worker comes along and adds a new feature; you are now providing labels too, and these are localised:

results = [] content_length = 0  for somevar in some_iterable:     label = translations.get_label(somevar)     output = some_process_that_produces_utf8(somevar)      content_length += len(label) + len(output) + 1     results.append(label + '\n')     results.append(output)  headers = {     'Content-Length': str(content_length),     'Content-Type': 'text/html; charset=utf8', } start_response(200, headers) return results 

You tested this in English and everything still works, great!

However, the translations.get_label() library actually returns Unicode values and when you switch locale, the labels contain non-ASCII characters.

The WSGI library writes out those results to the socket, and all the Unicode values get auto-encoded for you, since you set setdefaultencoding() to UTF-8, but the length you calculated is entirely wrong. It'll be too short as UTF-8 encodes everything outside of the ASCII range with more than one byte.

All this is ignoring the possibility that you are actually working with data in a different codec; you could be writing out Latin-1 + Unicode, and now you have an incorrect length header and a mix of data encodings.

Had you not used sys.setdefaultencoding() an exception would have been raised and you knew you had a bug, but now your clients are complaining about incomplete responses; there are bytes missing at the end of the page and you don't quite know how that happened.

Note that this scenario doesn't even involve 3rd party libraries that may or may not depend on the default still being ASCII. The sys.setdefaultencoding() setting is global, applying to all code running in the interpreter. How sure are you there are no issues in those libraries involving implicit encoding or decoding?

That Python 2 encodes and decodes between str and unicode types implicitly can be helpful and safe when you are dealing with ASCII data only. But you really need to know when you are mixing Unicode and byte string data accidentally, rather than plaster over it with a global brush and hope for the best.

like image 34
Martijn Pieters Avatar answered Oct 07 '22 08:10

Martijn Pieters