Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python, windows : parsing command lines with shlex

When you have to split a command-line, for example to call Popen, the best practice seems to be

subprocess.Popen(shlex.split(cmd), ...

but RTFM

The shlex class makes it easy to write lexical analyzers for simple syntaxes resembling that of the Unix shell ...

So, what's the correct way on win32? And what about quote parsing and POSIX vs non-POSIX mode?

like image 595
Massimo Avatar asked Nov 06 '15 05:11

Massimo


1 Answers

There is no valid command-line splitting function so far in the Python stdlib for Windows/multi-platform so far. (Mar 2016)

subprocess

So in short for subprocess.Popen .call etc. best do like:

if sys.platform == 'win32':
    args = cmd
else:
    args = shlex.split(cmd)
subprocess.Popen(args, ...)

On Windows the split is not necessary for either values of shell option and internally Popen just uses subprocess.list2cmdline to again re-join the split arguments :-) .

With option shell=True the shlex.split is not necessary on Unix either.

Split or not, on Windows for starting .bat or .cmd scripts (unlike .exe .com) you need to include the file extension explicitely - unless shell=True.

Notes on command-line splitting nonetheless:

shlex.split(cmd, posix=0) retains backslashes in Windows paths, but it doesn't understand quoting & escaping right. Its not very clear what the posix=0 mode of shlex is good for at all - but 99% it certainly seduces Windows/cross-platform programmers ...

Windows API exposes ctypes.windll.shell32.CommandLineToArgvW:

Parses a Unicode command line string and returns an array of pointers to the command line arguments, along with a count of such arguments, in a way that is similar to the standard C run-time argv and argc values.

def win_CommandLineToArgvW(cmd):
    import ctypes
    nargs = ctypes.c_int()
    ctypes.windll.shell32.CommandLineToArgvW.restype = ctypes.POINTER(ctypes.c_wchar_p)
    lpargs = ctypes.windll.shell32.CommandLineToArgvW(unicode(cmd), ctypes.byref(nargs))
    args = [lpargs[i] for i in range(nargs.value)]
    if ctypes.windll.kernel32.LocalFree(lpargs):
        raise AssertionError
    return args

However that function CommandLineToArgvW is bogus - or just weakly similar to the mandatory standard C argv & argc parsing:

>>> win_CommandLineToArgvW('aaa"bbb""" ccc')
[u'aaa"bbb"""', u'ccc']
>>> win_CommandLineToArgvW('""  aaa"bbb""" ccc')
[u'', u'aaabbb" ccc']
>>> 
C:\scratch>python -c "import sys;print(sys.argv)" aaa"bbb""" ccc
['-c', 'aaabbb"', 'ccc']

C:\scratch>python -c "import sys;print(sys.argv)" ""  aaa"bbb""" ccc
['-c', '', 'aaabbb"', 'ccc']

Watch http://bugs.python.org/issue1724822 for possibly future additions in the Python lib. (The mentioned function on "fisheye3" server doesn't really work correct.)


Cross-platform candidate function

Valid Windows command-line splitting is rather crazy. E.g. try \ \\ \" \\"" \\\"aaa """" ...

My current candidate function for cross-platform command-line splitting is the following function which I consider for Python lib proposal. Its multi-platform; its ~10x faster than shlex, which does single-char stepping and streaming; and also respects pipe-related characters (unlike shlex). It stands a list of tough real-shell-tests already on Windows & Linux bash, plus the legacy posix test patterns of test_shlex. Interested in feedback about remaining bugs.

def cmdline_split(s, platform='this'):
    """Multi-platform variant of shlex.split() for command-line splitting.
    For use with subprocess, for argv injection etc. Using fast REGEX.

    platform: 'this' = auto from current platform;
              1 = POSIX; 
              0 = Windows/CMD
              (other values reserved)
    """
    if platform == 'this':
        platform = (sys.platform != 'win32')
    if platform == 1:
        RE_CMD_LEX = r'''"((?:\\["\\]|[^"])*)"|'([^']*)'|(\\.)|(&&?|\|\|?|\d?\>|[<])|([^\s'"\\&|<>]+)|(\s+)|(.)'''
    elif platform == 0:
        RE_CMD_LEX = r'''"((?:""|\\["\\]|[^"])*)"?()|(\\\\(?=\\*")|\\")|(&&?|\|\|?|\d?>|[<])|([^\s"&|<>]+)|(\s+)|(.)'''
    else:
        raise AssertionError('unkown platform %r' % platform)

    args = []
    accu = None   # collects pieces of one arg
    for qs, qss, esc, pipe, word, white, fail in re.findall(RE_CMD_LEX, s):
        if word:
            pass   # most frequent
        elif esc:
            word = esc[1]
        elif white or pipe:
            if accu is not None:
                args.append(accu)
            if pipe:
                args.append(pipe)
            accu = None
            continue
        elif fail:
            raise ValueError("invalid or incomplete shell string")
        elif qs:
            word = qs.replace('\\"', '"').replace('\\\\', '\\')
            if platform == 0:
                word = word.replace('""', '"')
        else:
            word = qss   # may be even empty; must be last

        accu = (accu or '') + word

    if accu is not None:
        args.append(accu)

    return args
like image 78
kxr Avatar answered Nov 15 '22 22:11

kxr