In PHP I have the following string :
$str = "AAA, BBB, (CCC,DDD), 'EEE', 'FFF,GGG', ('HHH','III'), (('JJJ','KKK'), LLL, (MMM,NNN)) , OOO";
I need to split this string into the following parts:
AAA
BBB
(CCC,DDD)
'EEE'
'FFF,GGG'
('HHH','III')
(('JJJ','KKK'),LLL, (MMM,NNN))
OOO
I tried several regexes, but couldn't find a solution. Any ideas?
UPDATE
I've decided using regex is not really the best solution, when dealing with malformed data, escaped quotes, etc.
Thanks to suggestions made on here, I found a function that uses parsing, which I rewrote to suit my needs. It can handle different kind of brackets and the separator and quote are parameters as well.
function explode_brackets($str, $separator=",", $leftbracket="(", $rightbracket=")", $quote="'", $ignore_escaped_quotes=true ) {
$buffer = '';
$stack = array();
$depth = 0;
$betweenquotes = false;
$len = strlen($str);
for ($i=0; $i<$len; $i++) {
$previouschar = $char;
$char = $str[$i];
switch ($char) {
case $separator:
if (!$betweenquotes) {
if (!$depth) {
if ($buffer !== '') {
$stack[] = $buffer;
$buffer = '';
}
continue 2;
}
}
break;
case $quote:
if ($ignore_escaped_quotes) {
if ($previouschar!="\\") {
$betweenquotes = !$betweenquotes;
}
} else {
$betweenquotes = !$betweenquotes;
}
break;
case $leftbracket:
if (!$betweenquotes) {
$depth++;
}
break;
case $rightbracket:
if (!$betweenquotes) {
if ($depth) {
$depth--;
} else {
$stack[] = $buffer.$char;
$buffer = '';
continue 2;
}
}
break;
}
$buffer .= $char;
}
if ($buffer !== '') {
$stack[] = $buffer;
}
return $stack;
}
Instead of a preg_split
, do a preg_match_all
:
$str = "AAA, BBB, (CCC,DDD), 'EEE', 'FFF,GGG', ('HHH','III'), (('JJJ','KKK'), LLL, (MMM,NNN)) , OOO";
preg_match_all("/\((?:[^()]|(?R))+\)|'[^']*'|[^(),\s]+/", $str, $matches);
print_r($matches);
will print:
Array ( [0] => Array ( [0] => AAA [1] => BBB [2] => (CCC,DDD) [3] => 'EEE' [4] => 'FFF,GGG' [5] => ('HHH','III') [6] => (('JJJ','KKK'), LLL, (MMM,NNN)) [7] => OOO ) )
The regex \((?:[^()]|(?R))+\)|'[^']*'|[^(),\s]+
can be divided in three parts:
\((?:[^()]|(?R))+\)
, which matches balanced pairs of parenthesis'[^']*'
matching a quoted string[^(),\s]+
which matches any char-sequence not consisting of '('
, ')'
, ','
or white-space charsA spartan regex that tokenizes and also validates all the tokens that it extracts:
\G\s*+((\((?:\s*+(?2)\s*+(?(?!\)),)|\s*+[^()',\s]++\s*+(?(?!\)),)|\s*+'[^'\r\n]*+'\s*+(?(?!\)),))++\))|[^()',\s]++|'[^'\r\n]*+')\s*+(?:,|$)
Regex101
Put it in string literal, with delimiter:
'/\G\s*+((\((?:\s*+(?2)\s*+(?(?!\)),)|\s*+[^()\',\s]++\s*+(?(?!\)),)|\s*+\'[^\'\r\n]*+\'\s*+(?(?!\)),))++\))|[^()\',\s]++|\'[^\'\r\n]*+\')\s*+(?:,|$)/'
ideone
The result is in capturing group 1. In the example on ideone, I specify PREG_OFFSET_CAPTURE
flag, so that you can check against the last match in group 0 (entire match) whether the entire source string has been consumed or not.
\s
. Consequently, it may not span multiple lines.(
, )
, '
or ,
.'
.,
(
and ends with )
.()
is not allowed.,
. Single trailing comma ,
is considered valid.\s
, which includes new line character) are arbitrarily allowed between token(s), comma(s) ,
separating tokens, and the bracket(s) (
, )
of the bracket tokens.\G\s*+ ( ( \( (?: \s*+ (?2) \s*+ (?(?!\)),) | \s*+ [^()',\s]++ \s*+ (?(?!\)),) | \s*+ '[^'\r\n]*+' \s*+ (?(?!\)),) )++ \) ) | [^()',\s]++ | '[^'\r\n]*+' ) \s*+(?:,|$)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With