I'm running python 2.6 with latest IPython on Windows XP SP3, and I have two questions. First one of my problems is, when under IPython, I cannot input Unicode strings directly, and, as a result, cannot open files with non-latin names. Let me demonstrate. Under usual python this works:
>>> sys.getdefaultencoding()
'ascii'
>>> sys.getfilesystemencoding()
'mbcs'
>>> fd = open(u'm:/Блокнот/home.tdl')
>>> print u'm:/Блокнот/home.tdl'
m:/Блокнот/home.tdl
>>>
It's cyrillic in there, by the way. And under the IPython I get:
In [49]: sys.getdefaultencoding()
Out[49]: 'ascii'
In [50]: sys.getfilesystemencoding()
Out[50]: 'mbcs'
In [52]: fd = open(u'm:/Блокнот/home.tdl')
---------------------------------------------------------------------------
IOError Traceback (most recent call last)
C:\Documents and Settings\andrey\<ipython console> in <module>()
IOError: [Errno 2] No such file or directory: u'm:/\x81\xab\xae\xaa\xad\xae\xe2/home.tdl'
In [53]: print u'm:/Блокнот/home.tdl'
-------------->print(u'm:/Блокнот/home.tdl')
ERROR: An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line statement', (15, 0))
---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
C:\Documents and Settings\andrey\<ipython console> in <module>()
C:\Program Files\Python26\lib\encodings\cp866.pyc in encode(self, input, errors)
10
11 def encode(self,input,errors='strict'):
---> 12 return codecs.charmap_encode(input,errors,encoding_map)
13
14 def decode(self,input,errors='strict'):
UnicodeEncodeError: 'charmap' codec can't encode characters in position 3-9: character maps to <und
In [54]:
The second problem is less frustrating, but still. When I try to open a file, and specify file name argument as non-unicode string, it does not open. I have to forcibly decode string from OEM charset, before I could open files, which is pretty inconvenient:
>>> fd2 = open('m:/Блокнот/home.tdl'.decode('cp866'))
>>>
Maybe it has something to with my regional settings, I don't know, because I can't even cut-and-paste cyrillic text from console. I've put "Russian" everywhere in regional settings, but it does not seem to work.
Only a limited number of Unicode characters are mapped to strings. Thus, any character that is not-represented / mapped will cause the encoding to fail and raise UnicodeEncodeError. To avoid this error use the encode( utf-8 ) and decode( utf-8 ) functions accordingly in your code.
Python's string type uses the Unicode Standard for representing characters, which lets Python programs work with all these different possible characters.
By default, Python uses utf-8 encoding.
Normal strings in Python are stored internally as 8-bit ASCII, while Unicode strings are stored as 16-bit Unicode. This allows for a more varied set of characters, including special characters from most languages in the world.
Yes. Typing Unicode at the console is always problematic and generally best avoided, but IPython is particularly broke. It converts characters you type on its console as if they were encoded in ISO-8859-1, regardless of the actual encoding you're giving it.
For now, you'll have to say u'm:/\u0411\u043b\u043e\u043a\u043d\u043e\u0442/home.tdl'
.
Perversely enough, this will work:
fd = open('m:/Блокнот/home.tdl')
Or:
fd = open('m:/Блокнот/home.tdl'.encode('utf-8'))
This gets around ipython's bug by inputting the string as a raw UTF-8 encoded byte-string. ipython doesn't try any funny business with it. You're then free to encode it into a unicode string if you like, and get on with your life.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With