Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I handle Python unicode strings with null-bytes the 'right' way?

Question

It seems that PyWin32 is comfortable with giving null-terminated unicode strings as return values. I would like to deal with these strings the 'right' way.

Let's say I'm getting a string like: u'C:\\Users\\Guest\\MyFile.asy\x00\x00sy'. This appears to be a C-style null-terminated string hanging out in a Python unicode object. I want to trim this bad boy down to a regular ol' string of characters that I could, for example, display in a window title bar.

Is trimming the string off at the first null byte the right way to deal with it?

I didn't expect to get a return value like this, so I wonder if I'm missing something important about how Python, Win32, and unicode play together... or if this is just a PyWin32 bug.

Background

I'm using the Win32 file chooser function GetOpenFileNameW from the PyWin32 package. According to the documentation, this function returns a tuple containing the full filename path as a Python unicode object.

When I open the dialog with an existing path and filename set, I get a strange return value.

For example I had the default set to: C:\\Users\\Guest\\MyFileIsReallyReallyReallyAwesome.asy

In the dialog I changed the name to MyFile.asy and clicked save.

The full path part of the return value was: u'C:\Users\Guest\MyFile.asy\x00wesome.asy'`

I expected it to be: u'C:\\Users\\Guest\\MyFile.asy'

The function is returning a recycled buffer without trimming off the terminating bytes. Needless to say, the rest of my code wasn't set up for handling a C-style null-terminated string.

Demo Code

The following code demonstrates null-terminated string in return value from GetSaveFileNameW.

Directions: In the dialog change the filename to 'MyFile.asy' then click Save. Observe what is printed to the console. The output I get is u'C:\\Users\\Guest\\MyFile.asy\x00wesome.asy'.

import win32gui, win32con

if __name__ == "__main__":
    initial_dir = 'C:\\Users\\Guest'
    initial_file = 'MyFileIsReallyReallyReallyAwesome.asy'
    filter_string = 'All Files\0*.*\0'
    (filename, customfilter, flags) = \
        win32gui.GetSaveFileNameW(InitialDir=initial_dir,
                    Flags=win32con.OFN_EXPLORER, File=initial_file,
                    DefExt='txt', Title="Save As", Filter=filter_string,
                    FilterIndex=0)
    print repr(filename)

Note: If you don't shorten the filename enough (for example, if you try MyFileIsReally.asy) the string will be complete without a null byte.

Environment

Windows 7 Professional 64-bit (no service pack), Python 2.7.1, PyWin32 Build 216

UPDATE: PyWin32 Tracker Artifact

Based on the comments and answers I have received so far, this is likely a pywin32 bug so I filed a tracker artifact.

UPDATE 2: Fixed!

Mark Hammond reported in the tracker artifact that this is indeed a bug. A fix was checked in to rev f3fdaae5e93d, so hopefully that will make the next release.

I think Aleksi Torhamo's answer below is the best solution for versions of PyWin32 before the fix.

like image 569
Steven T. Snyder Avatar asked Apr 05 '11 23:04

Steven T. Snyder


People also ask

How do you Unicode a string in Python?

To allow working with Unicode characters, Python 2 has a unicode type which is a collection of Unicode code points (like Python 3's str type). The line ustring = u'A unicode \u018e string \xf1' creates a Unicode string with 20 characters.

How does Python handle Unicode errors?

Only a limited number of Unicode characters are mapped to strings. Thus, any character that is not-represented / mapped will cause the encoding to fail and raise UnicodeEncodeError. To avoid this error use the encode( utf-8 ) and decode( utf-8 ) functions accordingly in your code.

How do you escape Unicode in Python?

These handlers are invoked whenever a problem or error occurs in the process of encoding or decoding the string or given text. To include Unicode characters in the Python program, we first use Unicode escape symbol \u before any string, which can be considered as a Unicode-type variable.

Are Unicode strings null terminated?

Your Unicode applications should always cast zero to TCHAR when using null-terminated strings. The code 0x0000 is the Unicode string terminator for a null-terminated string. A single null byte is not sufficient for this code, because many Unicode characters contain null bytes as either the high or the low byte.


1 Answers

I'd say it's a bug. The right way to deal with it would probably be fixing pywin32, but in case you aren't feeling adventurous enough, just trim it.

You can get everything before the first '\x00' with filename.split('\x00', 1)[0].

like image 117
Aleksi Torhamo Avatar answered Oct 11 '22 08:10

Aleksi Torhamo