In PHP I have the following string :
$str = "AAA, BBB, (CCC,DDD), 'EEE', 'FFF,GGG', ('HHH','III'), (('JJJ','KKK'), LLL, (MMM,NNN)) , OOO";
I need to split this string into the following parts:
AAA
BBB
(CCC,DDD)
'EEE'
'FFF,GGG'
('HHH','III')
(('JJJ','KKK'),LLL, (MMM,NNN))
OOO
I tried several regexes, but couldn't find a solution. Any ideas?
UPDATE
I've decided using regex is not really the best solution, when dealing with malformed data, escaped quotes, etc.
Thanks to suggestions made on here, I found a function that uses parsing, which I rewrote to suit my needs. It can handle different kind of brackets and the separator and quote are parameters as well.
function explode_brackets($str, $separator=",", $leftbracket="(", $rightbracket=")", $quote="'", $ignore_escaped_quotes=true ) {
$buffer = '';
$stack = array();
$depth = 0;
$betweenquotes = false;
$len = strlen($str);
for ($i=0; $i<$len; $i++) {
$previouschar = $char;
$char = $str[$i];
switch ($char) {
case $separator:
if (!$betweenquotes) {
if (!$depth) {
if ($buffer !== '') {
$stack[] = $buffer;
$buffer = '';
}
continue 2;
}
}
break;
case $quote:
if ($ignore_escaped_quotes) {
if ($previouschar!="\\") {
$betweenquotes = !$betweenquotes;
}
} else {
$betweenquotes = !$betweenquotes;
}
break;
case $leftbracket:
if (!$betweenquotes) {
$depth++;
}
break;
case $rightbracket:
if (!$betweenquotes) {
if ($depth) {
$depth--;
} else {
$stack[] = $buffer.$char;
$buffer = '';
continue 2;
}
}
break;
}
$buffer .= $char;
}
if ($buffer !== '') {
$stack[] = $buffer;
}
return $stack;
}
Instead of a preg_split, do a preg_match_all:
$str = "AAA, BBB, (CCC,DDD), 'EEE', 'FFF,GGG', ('HHH','III'), (('JJJ','KKK'), LLL, (MMM,NNN)) , OOO";
preg_match_all("/\((?:[^()]|(?R))+\)|'[^']*'|[^(),\s]+/", $str, $matches);
print_r($matches);
will print:
Array
(
[0] => Array
(
[0] => AAA
[1] => BBB
[2] => (CCC,DDD)
[3] => 'EEE'
[4] => 'FFF,GGG'
[5] => ('HHH','III')
[6] => (('JJJ','KKK'), LLL, (MMM,NNN))
[7] => OOO
)
)
The regex \((?:[^()]|(?R))+\)|'[^']*'|[^(),\s]+ can be divided in three parts:
\((?:[^()]|(?R))+\), which matches balanced pairs of parenthesis'[^']*' matching a quoted string[^(),\s]+ which matches any char-sequence not consisting of '(', ')', ',' or white-space charsA spartan regex that tokenizes and also validates all the tokens that it extracts:
\G\s*+((\((?:\s*+(?2)\s*+(?(?!\)),)|\s*+[^()',\s]++\s*+(?(?!\)),)|\s*+'[^'\r\n]*+'\s*+(?(?!\)),))++\))|[^()',\s]++|'[^'\r\n]*+')\s*+(?:,|$)
Regex101
Put it in string literal, with delimiter:
'/\G\s*+((\((?:\s*+(?2)\s*+(?(?!\)),)|\s*+[^()\',\s]++\s*+(?(?!\)),)|\s*+\'[^\'\r\n]*+\'\s*+(?(?!\)),))++\))|[^()\',\s]++|\'[^\'\r\n]*+\')\s*+(?:,|$)/'
ideone
The result is in capturing group 1. In the example on ideone, I specify PREG_OFFSET_CAPTURE flag, so that you can check against the last match in group 0 (entire match) whether the entire source string has been consumed or not.
\s. Consequently, it may not span multiple lines.(, ), ' or ,.'.,
( and ends with ).() is not allowed.,. Single trailing comma , is considered valid.\s, which includes new line character) are arbitrarily allowed between token(s), comma(s) , separating tokens, and the bracket(s) (, ) of the bracket tokens.
\G\s*+
(
(
\(
(?:
\s*+
(?2)
\s*+
(?(?!\)),)
|
\s*+
[^()',\s]++
\s*+
(?(?!\)),)
|
\s*+
'[^'\r\n]*+'
\s*+
(?(?!\)),)
)++
\)
)
|
[^()',\s]++
|
'[^'\r\n]*+'
)
\s*+(?:,|$)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With