Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular Expression to extract php code partially (( array definition ))

Tags:

arrays

regex

php

I have php code stored (( array definition )) in a string like this

$code=' array(

  0  => "a",
 "a" => $GlobalScopeVar,
 "b" => array("nested"=>array(1,2,3)),  
 "c" => function() use (&$VAR) { return isset($VAR) ? "defined" : "undefined" ; },

); ';

there is a regular expression to extract this array??, i mean i want something like

$array=(  

  0  => '"a"',
 'a' => '$GlobalScopeVar',
 'b' => 'array("nested"=>array(1,2,3))',
 'c' => 'function() use (&$VAR) { return isset($VAR) ? "defined" : "undefined" ; }',

);

pD :: i do research trying to find a regular expression but nothing was found.
pD2 :: gods of stackoverflow, let me bounty this now and i will offer 400 :3
pD3 :: this will be used in a internal app, where i need extract an array of some php file to be 'processed' in parts, i try explain with this codepad.org/td6LVVme

like image 206
AgelessEssence Avatar asked Jun 14 '13 22:06

AgelessEssence


Video Answer


2 Answers

Regex

So here's the MEGA regex I came up with:

\s*                                     # white spaces
########################## KEYS START ##########################
(?:                                     # We\'ll use this to make keys optional
(?P<keys>                               # named group: keys
\d+                                     # match digits
|                                       # or
"(?(?=\\\\")..|[^"])*"                  # match string between "", works even 4 escaped ones "hello \" world"
|                                       # or
\'(?(?=\\\\\')..|[^\'])*\'              # match string between \'\', same as above :p
|                                       # or
\$\w+(?:\[(?:[^[\]]|(?R))*\])*          # match variables $_POST, $var, $var["foo"], $var["foo"]["bar"], $foo[$bar["fail"]]
)                                       # close group: keys
########################## KEYS END ##########################
\s*                                     # white spaces
=>                                      # match =>
)?                                      # make keys optional
\s*                                     # white spaces
########################## VALUES START ##########################
(?P<values>                             # named group: values
\d+                                     # match digits
|                                       # or
"(?(?=\\\\")..|[^"])*"                  # match string between "", works even 4 escaped ones "hello \" world"
|                                       # or
\'(?(?=\\\\\')..|[^\'])*\'              # match string between \'\', same as above :p
|                                       # or
\$\w+(?:\[(?:[^[\]]|(?R))*\])*          # match variables $_POST, $var, $var["foo"], $var["foo"]["bar"], $foo[$bar["fail"]]
|                                       # or
array\s*\((?:[^()]|(?R))*\)             # match an array()
|                                       # or
\[(?:[^[\]]|(?R))*\]                    # match an array, new PHP array syntax: [1, 3, 5] is the same as array(1,3,5)
|                                       # or
(?:function\s+)?\w+\s*                  # match functions: helloWorld, function name
(?:\((?:[^()]|(?R))*\))                 # match function parameters (wut), (), (array(1,2,4))
(?:(?:\s*use\s*\((?:[^()]|(?R))*\)\s*)? # match use(&$var), use($foo, $bar) (optionally)
\{(?:[^{}]|(?R))*\}                     # match { whatever}
)?;?                                    # match ; (optionally)
)                                       # close group: values
########################## VALUES END ##########################
\s*                                     # white spaces

I've put some comments, note that you need to use 3 modifiers:
x : let's me make comments s : match newlines with dots i : match case insensitive

PHP

$code='array(0  => "a", 123 => 123, $_POST["hello"][\'world\'] => array("is", "actually", "An array !"), 1234, \'got problem ?\', 
 "a" => $GlobalScopeVar, $test_further => function test($noway){echo "this works too !!!";}, "yellow" => "blue",
 "b" => array("nested"=>array(1,2,3), "nested"=>array(1,2,3),"nested"=>array(1,2,3)), "c" => function() use (&$VAR) { return isset($VAR) ? "defined" : "undefined" ; }
  "bug", "fixed", "mwahahahaa" => "Yeaaaah"
);'; // Sample data

$code = preg_replace('#(^\s*array\s*\(\s*)|(\s*\)\s*;?\s*$)#s', '', $code); // Just to get ride of array( at the beginning, and ); at the end

preg_match_all('~
\s*                                     # white spaces
########################## KEYS START ##########################
(?:                                     # We\'ll use this to make keys optional
(?P<keys>                               # named group: keys
\d+                                     # match digits
|                                       # or
"(?(?=\\\\")..|[^"])*"                  # match string between "", works even 4 escaped ones "hello \" world"
|                                       # or
\'(?(?=\\\\\')..|[^\'])*\'              # match string between \'\', same as above :p
|                                       # or
\$\w+(?:\[(?:[^[\]]|(?R))*\])*          # match variables $_POST, $var, $var["foo"], $var["foo"]["bar"], $foo[$bar["fail"]]
)                                       # close group: keys
########################## KEYS END ##########################
\s*                                     # white spaces
=>                                      # match =>
)?                                      # make keys optional
\s*                                     # white spaces
########################## VALUES START ##########################
(?P<values>                             # named group: values
\d+                                     # match digits
|                                       # or
"(?(?=\\\\")..|[^"])*"                  # match string between "", works even 4 escaped ones "hello \" world"
|                                       # or
\'(?(?=\\\\\')..|[^\'])*\'              # match string between \'\', same as above :p
|                                       # or
\$\w+(?:\[(?:[^[\]]|(?R))*\])*          # match variables $_POST, $var, $var["foo"], $var["foo"]["bar"], $foo[$bar["fail"]]
|                                       # or
array\s*\((?:[^()]|(?R))*\)             # match an array()
|                                       # or
\[(?:[^[\]]|(?R))*\]                    # match an array, new PHP array syntax: [1, 3, 5] is the same as array(1,3,5)
|                                       # or
(?:function\s+)?\w+\s*                  # match functions: helloWorld, function name
(?:\((?:[^()]|(?R))*\))                 # match function parameters (wut), (), (array(1,2,4))
(?:(?:\s*use\s*\((?:[^()]|(?R))*\)\s*)? # match use(&$var), use($foo, $bar) (optionally)
\{(?:[^{}]|(?R))*\}                     # match { whatever}
)?;?                                    # match ; (optionally)
)                                       # close group: values
########################## VALUES END ##########################
\s*                                     # white spaces
~xsi', $code, $m); // Matching :p

print_r($m['keys']); // Print keys
print_r($m['values']); // Print values


// Since some keys may be empty in case you didn't specify them in the array, let's fill them up !
foreach($m['keys'] as $index => &$key){
    if($key === ''){
        $key = 'made_up_index_'.$index;
    }
}
$results = array_combine($m['keys'], $m['values']);
print_r($results); // printing results

Output

Array
(
    [0] => 0
    [1] => 123
    [2] => $_POST["hello"]['world']
    [3] => 
    [4] => 
    [5] => "a"
    [6] => $test_further
    [7] => "yellow"
    [8] => "b"
    [9] => "c"
    [10] => 
    [11] => 
    [12] => "mwahahahaa"
    [13] => "this is"
)
Array
(
    [0] => "a"
    [1] => 123
    [2] => array("is", "actually", "An array !")
    [3] => 1234
    [4] => 'got problem ?'
    [5] => $GlobalScopeVar
    [6] => function test($noway){echo "this works too !!!";}
    [7] => "blue"
    [8] => array("nested"=>array(1,2,3), "nested"=>array(1,2,3),"nested"=>array(1,2,3))
    [9] => function() use (&$VAR) { return isset($VAR) ? "defined" : "undefined" ; }
    [10] => "bug"
    [11] => "fixed"
    [12] => "Yeaaaah"
    [13] => "a test"
)
Array
(
    [0] => "a"
    [123] => 123
    [$_POST["hello"]['world']] => array("is", "actually", "An array !")
    [made_up_index_3] => 1234
    [made_up_index_4] => 'got problem ?'
    ["a"] => $GlobalScopeVar
    [$test_further] => function test($noway){echo "this works too !!!";}
    ["yellow"] => "blue"
    ["b"] => array("nested"=>array(1,2,3), "nested"=>array(1,2,3),"nested"=>array(1,2,3))
    ["c"] => function() use (&$VAR) { return isset($VAR) ? "defined" : "undefined" ; }
    [made_up_index_10] => "bug"
    [made_up_index_11] => "fixed"
    ["mwahahahaa"] => "Yeaaaah"
    ["this is"] => "a test"
)

                                   Online regex demo                                     Online php demo

Known bug (fixed)

    $code='array("aaa", "sdsd" => "dsdsd");'; // fail
    $code='array(\'aaa\', \'sdsd\' => "dsdsd");'; // fail
    $code='array("aaa", \'sdsd\' => "dsdsd");'; // succeed
    // Which means, if a value with no keys is followed
    // by key => value and they are using the same quotation
    // then it will fail (first value gets merged with the key)

Online bug demo

Credits

Goes to Bart Kiers for his recursive pattern to match nested brackets.

Advice

You maybe should go with a parser since regexes are sensitive. @bwoebi has done a great job in his answer.

like image 193
HamZa Avatar answered Oct 18 '22 04:10

HamZa


Even when you asked for a regex, it works also with pure PHP. token_get_all is here the key function. For a regex check @HamZa's answer out.

The advantage here is that it is more dynamic than a regex. A regex has a static pattern, while with token_get_all, you can decide after every single token what to do. It even escapes single quotes and backslashes where necessary, what a regex wouldn't do.

Also, in regex, you have, even when commented, problems to imagine what it should do; what code does is much easier to understand when you look at PHP code.

$code = ' array(

  0  => "a",
  "a" => $GlobalScopeVar,
  "b" => array("nested"=>array(1,2,3)),  
  "c" => function() use (&$VAR) { return isset($VAR) ? "defined" : "undefined" ; },
  "string_literal",
  12345

); ';

$token = token_get_all("<?php ".$code);
$newcode = "";

$i = 0;
while (++$i < count($token)) { // enter into array; then start.
        if (is_array($token[$i]))
                $newcode .= $token[$i][1];
        else
                $newcode .= $token[$i];

        if ($token[$i] == "(") {
                $ending = ")";
                break;
        }
        if ($token[$i] == "[") {
                $ending = "]";
                break;
        }
}

// init variables
$escape = 0;
$wait_for_non_whitespace = 0;
$parenthesis_count = 0;
$entry = "";

// main loop
while (++$i < count($token)) {
        // don't match commas in func($a, $b)
        if ($token[$i] == "(" || $token[$i] == "{") // ( -> normal parenthesis; { -> closures
                $parenthesis_count++;
        if ($token[$i] == ")" || $token[$i] == "}")
                $parenthesis_count--;

        // begin new string after T_DOUBLE_ARROW
        if (!$escape && $wait_for_non_whitespace && (!is_array($token[$i]) || $token[$i][0] != T_WHITESPACE)) {
                $escape = 1;
                $wait_for_non_whitespace = 0;
                $entry .= "'";
        }

        // here is a T_DOUBLE_ARROW, there will be a string after this
        if (is_array($token[$i]) && $token[$i][0] == T_DOUBLE_ARROW && !$escape) {
                $wait_for_non_whitespace = 1;
        }

        // entry ended: comma reached
        if (!$parenthesis_count && $token[$i] == "," || ($parenthesis_count == -1 && $token[$i] == ")" && $ending == ")") || ($ending == "]" && $token[$i] == "]")) {
                // go back to the first non-whitespace
                $whitespaces = "";
                if ($parenthesis_count == -1 || ($ending == "]" && $token[$i] == "]")) {
                        $cut_at = strlen($entry);
                        while ($cut_at && ord($entry[--$cut_at]) <= 0x20); // 0x20 == " "
                        $whitespaces = substr($entry, $cut_at + 1, strlen($entry));
                        $entry = substr($entry, 0, $cut_at + 1);
                }

                // $escape == true means: there was somewhere a T_DOUBLE_ARROW
                if ($escape) {
                        $escape = 0;
                        $newcode .= $entry."'";
                } else {
                        $newcode .= "'".addcslashes($entry, "'\\")."'";
                }

                $newcode .= $whitespaces.($parenthesis_count?")":(($ending == "]" && $token[$i] == "]")?"]":","));

                // reset
                $entry = "";
        } else {
                // add actual token to $entry
                if (is_array($token[$i])) {
                        $addChar = $token[$i][1];
                } else {
                        $addChar = $token[$i];
                }

                if ($entry == "" && $token[$i][0] == T_WHITESPACE) {
                        $newcode .= $addChar;
                } else {
                        $entry .= $escape?str_replace(array("'", "\\"), array("\\'", "\\\\"), $addChar):$addChar;
                }
        }
}

//append remaining chars like whitespaces or ;
$newcode .= $entry;

print $newcode;

Demo at: http://3v4l.org/qe4Q1

Should output:

array(

  0  => '"a"',
  "a" => '$GlobalScopeVar',
  "b" => 'array("nested"=>array(1,2,3))',  
  "c" => 'function() use (&$VAR) { return isset($VAR) ? "defined" : "undefined" ; }',
  '"string_literal"',
  '12345'

) 

You can, to get the array's data, print_r(eval("return $newcode;")); to get the entries of the array:

Array
(
    [0] => "a"
    [a] => $GlobalScopeVar
    [b] => array("nested"=>array(1,2,3))
    [c] => function() use (&$VAR) { return isset($VAR) ? "defined" : "undefined" ; }
    [1] => "string_literal"
    [2] => 12345
)
like image 30
bwoebi Avatar answered Oct 18 '22 03:10

bwoebi