I would like to configure my console on Windows XP to support UTF8 and to have python detect that and work with it.
So far, my attempts:
C:\Documents and Settings\Philippe>C:\Python25\python.exe
Python 2.5.2 (r252:60911, Feb 21 2008, 13:11:45) [MSC v.1310 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> print u'é'
é
>>> import sys
>>> sys.stdout.encoding
'cp437'
>>> quit()
So, by default I am in cp437 and python detects that just fine.
C:\Documents and Settings\Philippe>chcp 65001
Active code page: 65001
C:\Documents and Settings\Philippe>python
Python 2.5.2 (r252:60911, Feb 21 2008, 13:11:45) [MSC v.1310 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.stdout.encoding
'cp65001'
>>> print u'é'
C:\Documents and Settings\Philippe>
It seems like printing in UTF8 makes python crash now...
I would like to configure my console on Windows XP to support UTF8
I don't think it's going to happen.
The 65001 code page is buggy; some stdio calls behave incorrectly and break many tools. Whilst you can register cp65001 as an encoding manually:
def cp65001(name):
if name.lower()=='cp65001':
return codecs.lookup('utf-8')
codecs.register(cp65001)
and this allows you to print u'some unicode string'
, it doesn't allow you to write non-ASCII characters in that Unicode string. You get the same odd errors (IOError 0 et al) that you do when you try to write non-ASCII UTF-8 sequences directly as byte strings.
Unfortunately UTF-8 is a second-class citizen under Windows. NT's Unicode model was drawn up before UTF-8 existed and consequently you're expected to use two-byte-per-code-unit encodings (UTF-16, originally UCS-2) anywhere you want consistent Unicode. Using byte strings, like many portable apps and languages (such as Python) written with C's stdio
, doesn't fit that model.
And rewriting Python to use the Windows Unicode console calls (like WriteConsoleW) instead of the portable C stdio ones doesn't play well with shell tricks like piping and redirecting to a file. (Not to mention that you still have to change from the default terminal font to a TTF one before you can see the results working at all...)
Ultimately if you need a command line with working UTF-8 support for stdio-based apps, you'd probably be better off using an alternative to the Windows Console that deliberately supports it, such as Cygwin's, or Python's IDLE or pywin32's PythonWin.
When I try the same thing on Python 2.7 I get an error on import sys
:
LookupError: unknown encoding: cp65001
This implies to me that Python doesn't know how to work with the special Windows UTF-8 code page, and 2.5 handled the situation ungracefully.
Apparently this was investigated and not fixed in Python 3.2: http://bugs.python.org/issue6058
Update: In What's New In Python 3.3 it lists cp65001
support as a new feature.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With