Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Weird problem with input encoding in IPython

I'm running python 2.6 with latest IPython on Windows XP SP3, and I have two questions. First one of my problems is, when under IPython, I cannot input Unicode strings directly, and, as a result, cannot open files with non-latin names. Let me demonstrate. Under usual python this works:

>>> sys.getdefaultencoding()
'ascii'
>>> sys.getfilesystemencoding()
'mbcs'
>>> fd = open(u'm:/Блокнот/home.tdl')
>>> print u'm:/Блокнот/home.tdl'
m:/Блокнот/home.tdl
>>>

It's cyrillic in there, by the way. And under the IPython I get:

In [49]: sys.getdefaultencoding()
Out[49]: 'ascii'

In [50]: sys.getfilesystemencoding()
Out[50]: 'mbcs'

In [52]: fd = open(u'm:/Блокнот/home.tdl')
---------------------------------------------------------------------------
IOError                                   Traceback (most recent call last)

C:\Documents and Settings\andrey\<ipython console> in <module>()

IOError: [Errno 2] No such file or directory: u'm:/\x81\xab\xae\xaa\xad\xae\xe2/home.tdl'

In [53]: print u'm:/Блокнот/home.tdl'
-------------->print(u'm:/Блокнот/home.tdl')
ERROR: An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line statement', (15, 0))

---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)

C:\Documents and Settings\andrey\<ipython console> in <module>()

C:\Program Files\Python26\lib\encodings\cp866.pyc in encode(self, input, errors)
     10
     11     def encode(self,input,errors='strict'):
---> 12         return codecs.charmap_encode(input,errors,encoding_map)
     13
     14     def decode(self,input,errors='strict'):

UnicodeEncodeError: 'charmap' codec can't encode characters in position 3-9: character maps to <und

In [54]:

The second problem is less frustrating, but still. When I try to open a file, and specify file name argument as non-unicode string, it does not open. I have to forcibly decode string from OEM charset, before I could open files, which is pretty inconvenient:

>>> fd2 = open('m:/Блокнот/home.tdl'.decode('cp866'))
>>>

Maybe it has something to with my regional settings, I don't know, because I can't even cut-and-paste cyrillic text from console. I've put "Russian" everywhere in regional settings, but it does not seem to work.

like image 359
Andrey Balaguta Avatar asked Feb 14 '10 10:02

Andrey Balaguta


People also ask

How does Python handle Unicode errors?

Only a limited number of Unicode characters are mapped to strings. Thus, any character that is not-represented / mapped will cause the encoding to fail and raise UnicodeEncodeError. To avoid this error use the encode( utf-8 ) and decode( utf-8 ) functions accordingly in your code.

Does Python support Unicode?

Python's string type uses the Unicode Standard for representing characters, which lets Python programs work with all these different possible characters.

What is encoding standard in Python?

By default, Python uses utf-8 encoding.

What is Unicode string in Python?

Normal strings in Python are stored internally as 8-bit ASCII, while Unicode strings are stored as 16-bit Unicode. This allows for a more varied set of characters, including special characters from most languages in the world.


2 Answers

Yes. Typing Unicode at the console is always problematic and generally best avoided, but IPython is particularly broke. It converts characters you type on its console as if they were encoded in ISO-8859-1, regardless of the actual encoding you're giving it.

For now, you'll have to say u'm:/\u0411\u043b\u043e\u043a\u043d\u043e\u0442/home.tdl'.

like image 60
bobince Avatar answered Oct 10 '22 22:10

bobince


Perversely enough, this will work:

fd = open('m:/Блокнот/home.tdl')

Or:

fd = open('m:/Блокнот/home.tdl'.encode('utf-8'))

This gets around ipython's bug by inputting the string as a raw UTF-8 encoded byte-string. ipython doesn't try any funny business with it. You're then free to encode it into a unicode string if you like, and get on with your life.

like image 27
David Eyk Avatar answered Oct 10 '22 20:10

David Eyk