Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I make ruby ShellWords.shellescape work with multibyte characters?

I have been trying to call exec with an argument that contains multibyte characters that come from an environment variable on Windows, but have not found a solution that works yet. Here is what I have been able to debug so far.

For simplicity's sake assume that I have a directory called "Seán" that I am trying to use as an argument to exec. If I just call

exec 'script', "Se\u00E1n".encode("IBM437") 

The script that is exec'ed cannot find the file because the arg gets tweaked in such a way that the accented character is lost. If I do the following it works, but this is bad practice as the arg should be escaped before it goes to the shell.

exec "script #{"Se\u00E1n".encode("IBM437")}"

So my thought was that I would just use shellescape to protect the use of exec.

require 'shellwords'
exec "script #{"Se\u00E1n".encode("IBM437").shellescape}"

But the problem is that it escapes the special character so that it looks like the following - "Se\án". I figured out where this is happening and it is coming from this regular expression.

str.gsub!(/([^A-Za-z0-9_\-.,:\/@\n])/, "\\\\\\1")

Which at first glance seems to escape characters not in a known good set of shell characters. Unfortunately this set does not include special characters and so I run into problems.

What I am looking for is a regex that would do shell escaping that does not mess up special characters so that I can escape these args before passing them to exec.

like image 314
Ransom Briggs Avatar asked Nov 24 '15 22:11

Ransom Briggs


2 Answers

The regex /([^A-Za-z0-9_\-.,:\/@\n])/ only handles ASCII letters and digits, not all Unicode letters. The [^...] is a negated character class that matches all characters other than those specified in the class. So, all Я, Ц, Ą are removed with that expression as they are not matched with [A-Za-z].

What you need is to add shorthand classes to exclude all Unicode letters and digits. To make it even more safe, we can add a diacritic class so as to keep diacritics, too:

str.gsub(/([^\p{L}\p{M}\p{N}_.,:\/@\n-])/, "\\\\\\1")

Here, \p{L} matches all Unicode base letters, \p{M} matches all diacritics, and \p{N} matches any Unicode digits.

Note that a hyphen does not need to be escaped when placed at the start/end of the character class (or after a valid range or a shorthand character class).

like image 165
Wiktor Stribiżew Avatar answered Nov 03 '22 07:11

Wiktor Stribiżew


TL;DR

Escaped characters

Metacharacters


Code

String.class_eval do
    def escapeshell()
        # Escape shell special characters
        self.gsub!(/[#-&(-*;<>?\[-^`{-~\u00FF]/, '\\\\\0')
        # Escape unbalanced quotes (single and double quotes)
        self.gsub!(/(["'])(?:([^"']*(?:(?!\1)["'][^"']*)*)\1)?/) do
            if $2.nil? 
                '\\' + $1
            else
                # and escape quotes inside (e.g. "x'x" or 'y"y')
                qt = $1
                qt + $2.gsub(/["']/, '\\\\\0') + qt
            end
        end
        self
    end
end


# Test it
str = "(dir *.txt & dir \"\\some dir\\Sè\u00E1ñ*.rb\") | sort /R >Filé.txt 2>&1"
puts 'String:'
puts str

puts "\nEscaped:"
puts str.escapeshell

Output

String:
(dir *.txt & dir "\some dir\Sèáñ*.rb") | sort /R >Filé.txt 2>&1

Escaped:
\(dir \*.txt \& dir "\\some dir\\Sèáñ\*.rb"\) \| sort /R \>Filé.txt 2\>\&1

ideone demo


Description

Metacharacters

Considering the shell metacharacters that should be escaped:

# & % ; ` | * ? ~ < > ^ ( ) [ ] { } $ \ \u00FF

We can include each character in the character class:

[#&%;`|*?~<>^()\[\]{}$\\\u00FF]

Which is exactly the same as:

/[#-&(-*;<>?\[-^`{-~\u00FF]/

Then, we use gsub!() to prepend a backslash before any character that in the class:

str.gsub!(/[#-&(-*;<>?\[-^`{-~\u00FF]/, '\\\\\0')

Quotes

Only unbalanced quotes need to be escaped. This is important to preserve the command's arguments. With the following expression we match balanced quotes:

/(["'])[^"']*(?:(?!\1)["'][^"']*)*)\1/

As well as unbalanced, making the last part optional

/(["'])(?:[^"']*(?:(?!\1)["'][^"']*)*)\1)?/

But we also need to escape quotes inside another pair. That is single quotes inside double quotes and vice-versa. So we'll nest another gsub() to replace in the text matched inside quotes ($2):

str.gsub!(/(["'])(?:([^"']*(?:(?!\1)["'][^"']*)*)\1)?/) do
    if $2.nil? 
        '\\' + $1
    else
        qt = $1
        qt + $2.gsub(/["']/, '\\\\\0') + qt
    end
end
like image 25
Mariano Avatar answered Nov 03 '22 08:11

Mariano