Is there a native "PHP way" to parse command arguments from a string
? For example, given the following string
:
foo "bar \"baz\"" '\'quux\''
I'd like to create the following array
:
array(3) {
[0] =>
string(3) "foo"
[1] =>
string(7) "bar "baz""
[2] =>
string(6) "'quux'"
}
I've already tried to leverage token_get_all()
, but PHP's variable interpolation syntax (e.g. "foo ${bar} baz"
) pretty much rained on my parade.
I know full well that I could write my own parser. Command argument syntax is super simplistic, but if there's an existing native way to do it, I'd much prefer that over rolling my own.
EDIT: Please note that I am looking to parse the arguments from a string
, NOT from the shell/command-line.
EDIT #2: Below is a more comprehensive example of the expected input -> output for arguments:
foo -> foo
"foo" -> foo
'foo' -> foo
"foo'foo" -> foo'foo
'foo"foo' -> foo"foo
"foo\"foo" -> foo"foo
'foo\'foo' -> foo'foo
"foo\foo" -> foo\foo
"foo\\foo" -> foo\foo
"foo foo" -> foo foo
'foo foo' -> foo foo
There are two predefined variables in PHP, called $argc and $argv , that you can use to work with command-line arguments. The variable $argc simply tells you the number of arguments that were passed to the script. Remember that the name of the script you are running is always counted as an argument.
Introduction. When a PHP script is run from command line, $argv superglobal array contains arguments passed to it. First element in array $argv[0] is always the name of script. This variable is not available if register_argc_argv directive in php. ini is disabled.
PHP Built-in Functions PHP has over 1000 built-in functions that can be called directly, from within a script, to perform a specific task. Please check out our PHP reference for a complete overview of the PHP built-in functions.
$argc — The number of arguments passed to script.
As you can see, the first argument is actually the name of your PHP script, which is consistent with many other languages that use this "ARGV" functionality. The other command line arguments are then shown after that.
There really is no native function for parsing commands to my knowledge. However, I have created a function which does the trick natively in PHP. By using str_replace several times, you are able to convert the string into something array convertible.
When calling a PHP script from the command line you can use $argc to find out how many parameters are passed and $argv to access them. For example running the following script: <?php var_dump ($argc); //number of arguments passed var_dump ($argv); //the arguments passed ?>
The getopt () function is probably the most correct answer in the case of the question. Especially since it was made platform independent with PHP 5.3. In the particular case of this question and parsing multiple parameters, one way to leverage this function would be as follows:
Regexes are quite powerful: (?s)(?<!\\)("|')(?:[^\\]|\\.)*?\1|\S+
. So what does this expression mean ?
(?s)
: set the s
modifier to match newlines with a dot .
(?<!\\)
: negative lookbehind, check if there is no backslash preceding the next token("|')
: match a single or double quote and put it in group 1(?:[^\\]|\\.)*?
: match everything not \, or match \ with the immediately following (escaped) character\1
: match what is matched in the first group|
: or\S+
: match anything except whitespace one or more times.The idea is to capture a quote and group it to remember if it's a single or a double one. The negative lookbehinds are there to make sure we don't match escaped quotes. \1
is used to match the second pair of quotes. Finally we use an alternation to match anything that's not a whitespace. This solution is handy and is almost applicable for any language/flavor that supports lookbehinds and backreferences. Of course, this solution expects that the quotes are closed. The results are found in group 0.
Let's implement it in PHP:
$string = <<<INPUT
foo "bar \"baz\"" '\'quux\''
'foo"bar' "baz'boz"
hello "regex
world\""
"escaped escape\\\\"
INPUT;
preg_match_all('#(?<!\\\\)("|\')(?:[^\\\\]|\\\\.)*?\1|\S+#s', $string, $matches);
print_r($matches[0]);
If you wonder why I used 4 backslashes. Then take a look at my previous answer.
Output
Array
(
[0] => foo
[1] => "bar \"baz\""
[2] => '\'quux\''
[3] => 'foo"bar'
[4] => "baz'boz"
[5] => hello
[6] => "regex
world\""
[7] => "escaped escape\\"
)
Online regex demo Online php demo
Removing the quotes
Quite simple using named groups and a simple loop:
preg_match_all('#(?<!\\\\)("|\')(?<escaped>(?:[^\\\\]|\\\\.)*?)\1|(?<unescaped>\S+)#s', $string, $matches, PREG_SET_ORDER);
$results = array();
foreach($matches as $array){
if(!empty($array['escaped'])){
$results[] = $array['escaped'];
}else{
$results[] = $array['unescaped'];
}
}
print_r($results);
Online php demo
I've worked out the following expression to match the various enclosures and escapement:
$pattern = <<<REGEX
/
(?:
" ((?:(?<=\\\\)"|[^"])*) "
|
' ((?:(?<=\\\\)'|[^'])*) '
|
(\S+)
)
/x
REGEX;
preg_match_all($pattern, $input, $matches, PREG_SET_ORDER);
It matches:
Afterwards, you need to (carefully) remove the escaped characters:
$args = array();
foreach ($matches as $match) {
if (isset($match[3])) {
$args[] = $match[3];
} elseif (isset($match[2])) {
$args[] = str_replace(['\\\'', '\\\\'], ["'", '\\'], $match[2]);
} else {
$args[] = str_replace(['\\"', '\\\\'], ['"', '\\'], $match[1]);
}
}
print_r($args);
Update
For the fun of it, I've written a more formal parser, outlined below. It won't give you better performance, it's about three times slower than the regular expression mostly due its object oriented nature. I suppose the advantage is more academic than practical:
class ArgvParser2 extends StringIterator
{
const TOKEN_DOUBLE_QUOTE = '"';
const TOKEN_SINGLE_QUOTE = "'";
const TOKEN_SPACE = ' ';
const TOKEN_ESCAPE = '\\';
public function parse()
{
$this->rewind();
$args = [];
while ($this->valid()) {
switch ($this->current()) {
case self::TOKEN_DOUBLE_QUOTE:
case self::TOKEN_SINGLE_QUOTE:
$args[] = $this->QUOTED($this->current());
break;
case self::TOKEN_SPACE:
$this->next();
break;
default:
$args[] = $this->UNQUOTED();
}
}
return $args;
}
private function QUOTED($enclosure)
{
$this->next();
$result = '';
while ($this->valid()) {
if ($this->current() == self::TOKEN_ESCAPE) {
$this->next();
if ($this->valid() && $this->current() == $enclosure) {
$result .= $enclosure;
} elseif ($this->valid()) {
$result .= self::TOKEN_ESCAPE;
if ($this->current() != self::TOKEN_ESCAPE) {
$result .= $this->current();
}
}
} elseif ($this->current() == $enclosure) {
$this->next();
break;
} else {
$result .= $this->current();
}
$this->next();
}
return $result;
}
private function UNQUOTED()
{
$result = '';
while ($this->valid()) {
if ($this->current() == self::TOKEN_SPACE) {
$this->next();
break;
} else {
$result .= $this->current();
}
$this->next();
}
return $result;
}
public static function parseString($input)
{
$parser = new self($input);
return $parser->parse();
}
}
It's based on StringIterator
to walk through the string one character at a time:
class StringIterator implements Iterator
{
private $string;
private $current;
public function __construct($string)
{
$this->string = $string;
}
public function current()
{
return $this->string[$this->current];
}
public function next()
{
++$this->current;
}
public function key()
{
return $this->current;
}
public function valid()
{
return $this->current < strlen($this->string);
}
public function rewind()
{
$this->current = 0;
}
}
Well, you could also build this parser with a recursive regex:
$regex = "([a-zA-Z0-9.-]+|\"([^\"\\\\]+(?1)|\\\\.(?1)|)\"|'([^'\\\\]+(?2)|\\\\.(?2)|)')s";
Now that's a bit long, so let's break it out:
$identifier = '[a-zA-Z0-9.-]+';
$doubleQuotedString = "\"([^\"\\\\]+(?1)|\\\\.(?1)|)\"";
$singleQuotedString = "'([^'\\\\]+(?2)|\\\\.(?2)|)'";
$regex = "($identifier|$doubleQuotedString|$singleQuotedString)s";
So how does this work? Well, the identifier should be obvious...
The two quoted sub-patterns are basically, the same, so let's look at the single quoted string:
'([^'\\\\]+(?2)|\\\\.(?2)|)'
Really, that's a quote character followed by a recursive sub-pattern, followed by a end quote.
The magic happens in the sub-pattern.
[^'\\\\]+(?2)
That part basically consumes any non-quote and non-escape character. We don't care about them, so eat them up. Then, if we encounter either a quote or a backslash, trigger an attempt to match the entire sub-pattern again.
\\\\.(?2)
If we can consume a backslash, then consume the next character (without caring what it is), and recurse again.
Finally, we have an empty component (if the escaped character is last, or if there's no escape character).
Running this on the test input @HamZa provided returns the same result:
array(8) {
[0]=>
string(3) "foo"
[1]=>
string(13) ""bar \"baz\"""
[2]=>
string(10) "'\'quux\''"
[3]=>
string(9) "'foo"bar'"
[4]=>
string(9) ""baz'boz""
[5]=>
string(5) "hello"
[6]=>
string(16) ""regex
world\"""
[7]=>
string(18) ""escaped escape\\""
}
The main difference that happens is in terms of efficiency. This pattern should backtrack less (since it's a recursive pattern, there should be next to no backtracking for a well-formed string), where the other regex is a non-recursive regex and will backtrack every single character (that's what the ?
after the *
forces, non-greedy pattern consumption).
For short inputs this doesn't matter. The test case provided, they run within a few % of each other (margin of error is greater than the difference). But with a single long string with no escape sequences:
"with a really long escape sequence match that will force a large backtrack loop"
The difference is significant (100 runs):
float(0.00030398368835449)
float(0.00055909156799316)
Of course, we can partially lose this advantage with a lot of escape sequences:
"This is \" A long string \" With a\lot \of \"escape \sequences"
float(0.00040411949157715)
float(0.00045490264892578)
But note that the length still dominates. That's because the backtracker scales at O(n^2)
, where the recursive solution scales at O(n)
. However, since the recursive pattern always needs to recurse at least once, it's slower than the backtracking solution on short strings:
"1"
float(0.0002598762512207)
float(0.00017595291137695)
The tradeoff appears to happen around 15 characters... But both are fast enough that it won't make a difference unless you're parsing several KB or MB of data... But it's worth discussing...
On sane inputs, it won't make a significant difference. But if you're matching more than a few hundred bytes, it may start to add up significantly...
If you need to handle arbitrary "bare words" (unquoted strings), then you can change the original regex to:
$regex = "([^\s'\"]\S*|\"([^\"\\\\]+(?1)|\\\\.(?1)|)\"|'([^'\\\\]+(?2)|\\\\.(?2)|)')s";
However, it really depends on your grammar and what you consider a command or not. I'd suggest formalizing the grammar you expect...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With