I am very new to regex, and this is way too advanced for me. So I am asking the experts over here.
Problem I would like to retrieve the constants / values from a php define()
DEFINE('TEXT', 'VALUE');
Basically I would like a regex to be able to return the name of constant, and the value of constant from the above line. Just TEXT and VALUE . Is this even possible?
Why I need it? I am dealing with language file and I want to get all couples (name, value) and put them in array. I managed to do it with str_replace() and trim() etc.. but this way is long and I am sure it could be made easier with single line of regex.
Note: The VALUE may contain escaped single quotes as well. example:
DEFINE('TEXT', 'J\'ai');
I hope I am not asking for something too complicated. :)
Regards
You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML.
The Parse Regex operator (also called the extract operator) enables users comfortable with regular expression syntax to extract more complex data from log lines. Parse regex can be used, for example, to extract nested fields.
XML is not a regular language (that's a technical term) so you will never be able to parse it correctly using a regular expression.
For any kind of grammar-based parsing, regular expressions are usually an awful solution. Even smple grammars (like arithmetic) have nesting and it's on nesting (in particular) that regular expressions just fall over.
Fortunately PHP provides a far, far better solution for you by giving you access to the same lexical analyzer used by the PHP interpreter via the token_get_all() function. Give it a character stream of PHP code and it'll parse it into tokens ("lexemes"), which you can do a bit of simple parsing on with a pretty simple finite state machine.
Run this program (it's run as test.php so it tries it on itself). The file is deliberately formatted badly so you can see it handles that with ease.
<?
define('CONST1', 'value' );
define (CONST2, 'value2');
define( 'CONST3', time());
define('define', 'define');
define("test", VALUE4);
define('const5', //
'weird declaration'
) ;
define('CONST7', 3.14);
define ( /* comment */ 'foo', 'bar');
$defn = 'blah';
define($defn, 'foo');
define( 'CONST4', define('CONST5', 6));
header('Content-Type: text/plain');
$defines = array();
$state = 0;
$key = '';
$value = '';
$file = file_get_contents('test.php');
$tokens = token_get_all($file);
$token = reset($tokens);
while ($token) {
// dump($state, $token);
if (is_array($token)) {
if ($token[0] == T_WHITESPACE || $token[0] == T_COMMENT || $token[0] == T_DOC_COMMENT) {
// do nothing
} else if ($token[0] == T_STRING && strtolower($token[1]) == 'define') {
$state = 1;
} else if ($state == 2 && is_constant($token[0])) {
$key = $token[1];
$state = 3;
} else if ($state == 4 && is_constant($token[0])) {
$value = $token[1];
$state = 5;
}
} else {
$symbol = trim($token);
if ($symbol == '(' && $state == 1) {
$state = 2;
} else if ($symbol == ',' && $state == 3) {
$state = 4;
} else if ($symbol == ')' && $state == 5) {
$defines[strip($key)] = strip($value);
$state = 0;
}
}
$token = next($tokens);
}
foreach ($defines as $k => $v) {
echo "'$k' => '$v'\n";
}
function is_constant($token) {
return $token == T_CONSTANT_ENCAPSED_STRING || $token == T_STRING ||
$token == T_LNUMBER || $token == T_DNUMBER;
}
function dump($state, $token) {
if (is_array($token)) {
echo "$state: " . token_name($token[0]) . " [$token[1]] on line $token[2]\n";
} else {
echo "$state: Symbol '$token'\n";
}
}
function strip($value) {
return preg_replace('!^([\'"])(.*)\1$!', '$2', $value);
}
?>
Output:
'CONST1' => 'value'
'CONST2' => 'value2'
'CONST3' => 'time'
'define' => 'define'
'test' => 'VALUE4'
'const5' => 'weird declaration'
'CONST7' => '3.14'
'foo' => 'bar'
'CONST5' => '6'
This is basically a finite state machine that looks for the pattern:
function name ('define')
open parenthesis
constant
comma
constant
close parenthesis
in the lexical stream of a PHP source file and treats the two constants as a (name,value) pair. In doing so it handles nested define() statements (as per the results) and ignores whitespace and comments as well as working across multiple lines.
Note: I've deliberatley made it ignore the case when functions and variables are constant names or values but you can extend it to that as you wish.
It's also worth pointing out that PHP is quite forgiving when it comes to strings. They can be declared with single quotes, double quotes or (in certain circumstances) with no quotes at all. This can be (as pointed out by Gumbo) be an ambiguous reference reference to a constant and you have no way of knowing which it is (no guaranteed way anyway), giving you the chocie of:
Personally I would go for (1) then (3).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With