Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unicode filenames on Windows with Python & subprocess.Popen()

Why does the following occur:

>>> u'\u0308'.encode('mbcs')   #UMLAUT
'\xa8'
>>> u'\u041A'.encode('mbcs')   #CYRILLIC CAPITAL LETTER KA
'?'
>>>

I have a Python application accepting filenames from the operating system. It works for some international users, but not others.

For example, this unicode filename: u'\u041a\u0433\u044b\u044b\u0448\u0444\u0442'

will not encode with Windows 'mbcs' encoding (the one used by the filesystem, returned by sys.getfilesystemencoding()). I get '???????', indicating the encoder fails on those characters. But this makes no sense, since the filename came from the user to begin with.

Update: Here's the background to my reasons behind this... I have a file on my system with the name in Cyrillic. I want to call subprocess.Popen() with that file as an argument. Popen won't handle unicode. Normally I can get away with encoding the argument with the codec given by sys.getfilesystemencoding(). In this case it won't work

like image 913
Norman Avatar asked Dec 15 '09 20:12

Norman


People also ask

How to handle Unicode files in Python?

In this article, we will be exploring some methods that can be used in handling Unicode files in Python. Let’s start with the available modes and standard encodings. The safest way to open a file is via the context manager using the with statement. It will automatically close the file for us preventing any issues that may arise.

How do I change the encoding of a file with Unicode?

When opening a file for reading or writing, you can usually just provide the Unicode string as the filename, and it will be automatically converted to the right encoding for you: Functions in the os module such as os.stat () will also accept Unicode filenames.

What is the difference between Unicode and byte file names?

If you pass a Unicode string as the path, filenames will be decoded using the filesystem’s encoding and a list of Unicode strings will be returned, while passing a byte path will return the filenames as bytes. For example, assuming the default filesystem encoding is UTF-8, running the following program:

How are filenames with arbitrary Unicode characters implemented?

Most of the operating systems in common use today support filenames that contain arbitrary Unicode characters. Usually this is implemented by converting the Unicode string into some encoding that varies depending on the system.


2 Answers

In Py3K - at least from Python 3.2 - subprocess.Popen and sys.argv work consistently with (default unicode) strings on Windows. CreateProcessW and GetCommandLineW are used obviously.

In Python - up to v2.7.2 at least - subprocess.Popen is buggy with Unicode arguments. It sticks to CreateProcessA (while os.* are consistent with Unicode). And shlex.split creates additional nonsense.

Pywin32's win32process.CreateProcess also doesn't auto-switch to the W version, nor is there a win32process.CreateProcessW. Same with GetCommandLine. Thus ctypes.windll.kernel32.CreateProcessW... needs to be used. The subprocess module perhaps should be fixed regarding this issue.

UTF8 on argv[1:] with private apps remains clumsy on a Unicode OS. Such tricks may be legal for 8-bit "Latin1" string OSes like Linux.

UPDATE vaab has created a patched version of Popen for Python 2.7 which fixes the issue.
See https://gist.github.com/vaab/2ad7051fc193167f15f85ef573e54eb9
Blog post with explanations: http://vaab.blog.kal.fr/2017/03/16/fixing-windows-python-2-7-unicode-issue-with-subprocesss-popen/

like image 125
kxr Avatar answered Oct 22 '22 04:10

kxr


DISCLAIMER: I'm the author of the fix mentionned in the following.

To support unicode command line on windows with python 2.7, you can use this patch to subprocess.Popen(..)

The situation

Python 2 support of unicode command line on windows is very poor.

Are severly bugged:

  • issuing the unicode command line to the system from the caller side (via subprocess.Popen(..)),

  • and reading the current command line unicode arguments from the callee side (via sys.argv),

It is acknowledged and won't be fixed on Python 2. These are fixed in Python 3.

Technical Reasons

In Python 2, windows implementation of subprocess.Popen(..) and sys.argv use the non unicode ready windows systems call CreateProcess(..) (see python code, and MSDN doc of CreateProcess) and does not use GetCommandLineW(..) for sys.argv.

In Python 3, windows implementation of subprocess.Popen(..) make use of the correct windows systems calls CreateProcessW(..) starting from 3.0 (see code in 3.0) and sys.argv uses GetCommandLineW(..) starting from 3.3 (see code in 3.3).

How is it fixed

The given patch will leverage ctypes module to call C windows system CreateProcessW(..) directly. It proposes a new fixed Popen object by overriding private method Popen._execute_child(..) and private function _subprocess.CreateProcess(..) to setup and use CreateProcessW(..) from windows system lib in a way that mimics as much as possible how it is done in Python 3.6.

How to use it

How to use the given patch is demonstrated with this blog post explanation. It additionally shows how to read the current processes sys.argv with another fix.

like image 35
vaab Avatar answered Oct 22 '22 06:10

vaab