Imagine a string like this:
field1,field2(subfield1),field3(subfield2,subfield3),field4(),field5(subfield4(subsubfield,subsubfield2))
I would like to get an array like this:
array(
field1 => array(),
field2 => array(subfield1),
field3 => array(
subfield2,
subfield3
),
field4 => array(),
field5 => array(
subfield4 => array(
subsubfield => array(),
subsubfield => array()
)
)
)
I have this regex [a-zA-Z0-9]*\([^()]*(?:(?R)[^()]*)*\) which does some of the work outputting:
array(
field1,
field2(subfield1),
field3(subfield2,subfield3),
field4(),
field5(subfield4(subsubfield,subsubfield2))
)
Though this it not what I want. I'm kind of stuck now but the options I've come up with so far are:
For what it's worth, I've got to loop through the fields and subfields anyway. I've got some code that uses the given regex and runs the same match against its values when required later on. I want to parse the whole string including its nested substring at once.
Does anyone know how I get started on this? Which option would be the best (or better) approach? (readability vs resource usage vs complexity vs etc)
The problem you describe cannot be represented by a regular language, because regular languages cannot balance parentheses. However, most regular expression implementations have had features added to them over the years that allow for parsing of languages more sophisticated than regular languages. In particular, this problem could be solved with either .NET's balanced matching, or PCRE's recursive expressions (thanks, @Gumbo, for pointing that out in the comments).
However, just because you can do a thing doesn't mean you should. The problem with using a regular expression for this kind of task is that as you extend your metalanguage, it will get exponentially more difficult to modify your regular expression. Whereas parsers tend to be much more plastic and easy to extend.
So you could probably construct a series of regular expressions to cover non-pathological cases of your input, but why even try when you could just write a parser? They're easy to maintain, extremely fast (faster than regular expressions), easy to extend, and a lot of fun to boot.
I originally missed that this question was looking for a PHP solution, so I wrote it in JavaScript. I translated that into PHP, and left the original JavaScript solution at the end of the post.
function parse( $s ) {
// we will always have a "current context". the current context is the array we're
// currently operating in. when we start, this is simply an empty array. as new
// arrays are created, this context will change.
$context = array();
// since we have to keep track of how deep our context is, we keep a context stack
$contextStack = array(&$context);
// this accumulates the name of the current array
$name = '';
for( $i=0; $i<strlen($s); $i++ ) {
switch( $s[$i] ) {
case ',':
// if the last array hasn't been added to the current context
// (as will be the case for arrays lacking parens), we add it now
if( $name!='' && !array_key_exists( $name, $context ) )
$context[$name] = array();
// reset name accumulator
$name = '';
break;
case '(':
// we are entering a subcontext
// save a new array in the current context; this will become our new context
$context[$name] = array();
// switch context and add to context stack
$context = &$context[$name];
$contextStack[] = &$context;
// reset name accumulator
$name = '';
break;
case ')':
// we are exiting a context
// if we haven't saved this array in the current context, do so now
if( $name!='' && !array_key_exists( $name, $context ) )
$context[$name] = array();
// we can't just assign $context the return value of array_pop because
// that does not return a reference
array_pop($contextStack);
if( count($contextStack) == 0 ) throw new Exception( 'Unmatched parenthesis' );
$context = &$contextStack[count($contextStack)-1];
// reset name accumulator
$name = '';
break;
default:
// this is part of the field name
$name .= $s[$i];
}
}
// add any trailing arrays to the context (this will cover the case
// where our input ends in an array without parents)
if( $name!='' && !array_key_exists( $name, $context ) )
$context[$name] = array();
if( count( $contextStack ) != 1 ) throw new Exception( 'Unmatched parenthesis' );
return array_pop( $contextStack );
}
function parse(s) {
var root = { parent: null, children: [] };
var field = { parent: root, name: '', start_idx: 0, children: [] };
root.children.push( field );
for( var i=0; i<s.length; i++ ) {
switch( s[i] ) {
case ',':
// if this field didn't have any children, we have to set its text
if( !field.children.length )
field.text = s.substr( field.start_idx, i - field.start_idx + 1 );
// start a new field; create new field and change context
var newfield = { parent: field.parent, name: '', start_idx: i, children:[] };
field.parent.children.push(newfield);
field = newfield;
break;
case '(':
// start of a subfield; create subfield and change context
var subfield = { parent: field, name: '', start_idx: i, children:[] };
field.children.push(subfield);
field = subfield;
break;
case ')':
// end of a subfield; fill out subfield details and change context
if( !field.parent ) throw new Error( 'Unmatched parenthesis!' );
field.text = s.substr( field.start_idx, i - field.start_idx + 1 );
if( field.text==='()' ) {
// empty subfield; pop this subfield so it doesn't clutter the parent
field.parent.children.pop();
}
field = field.parent;
break;
default:
// this is part of the field name
field.name += s[i];
field.name = field.name.trim();
}
}
return root;
}
Now that we have a parsed tree of your language, we can pretty easily create some recursive code to emit your PHP:
function formatphp_namedarray(arr,indent,lastchild) {
var php = indent + arr.name + ' => array(';
if( arr.children.length ) {
if( arr.children.length===1 && arr.children[0].length===0 ) {
php += arr.children[0].name;
} else {
php += '\n';
indent += '\t';
for( var i=0; i<arr.children.length; i++ )
php += formatphp_namedarray(arr.children[i],indent,i===arr.children.length-1);
indent = indent.replace(/\t$/,'');
php += indent;
}
}
php += (lastchild?')':'),') + '\n';
return php;
}
function formatphp(t) {
var php = 'array(\n';
for( var i=0; i<t.children.length; i++ )
php += formatphp_namedarray( t.children[i], '\t', i===t.children.length-1 );
php += ')'
return php;
}
See it all working here: http://jsfiddle.net/6bguY/1/
Ok I think this does the trick:
$a = "field1,field2(subfield1),field3(subfield2,subfield3),field4(),field5(subfield4(subsubfield,subsubfield2,limit:50,offset:0))";
$output = array();
$outputStacktrace = array(&$output);
$depth = 0;
$buffer = $key = '';
$m = memory_get_usage();
for ($i = 0; $i < strlen($a); $i++)
if ($a[$i] == ':') {
$key = $buffer;
$buffer = '';
} elseif ($a[$i] == ',') {
if (strlen($buffer))
$outputStacktrace[$depth][$key ? $key : count($outputStacktrace[$depth])] = $buffer;
$buffer = $key = '';
} elseif ($a[$i] == '(') {
$outputStacktrace[$depth][$buffer] = array();
$outputStacktrace[$depth + 1] = &$outputStacktrace[$depth][$buffer];
$depth++;
$buffer = '';
} elseif ($a[$i] == ')') {
if (strlen($buffer))
$outputStacktrace[$depth][$key ? $key : count($outputStacktrace[$depth])] = $buffer;
$buffer = $key = '';
unset($outputStacktrace[$depth]);
$depth--;
} else
$buffer .= $a[$i];
var_dump($output);
All whithin a single loop. Pretty happy with it. Comments are greatly appreciated!
Edit: including fixed variable names, fixed last non-nested character truncation, added 'special' query elements: limit & offset.
<?php
$output = array();
$outputStack = array(&$output);
$depth = 0;
$buffer = $key = '';
$s = count($a);
for ($i = 0; $i < $s; $i++)
if ($a[$i] == ':') {
$key = $buffer;
$buffer = '';
} elseif ($a[$i] == ',' || $a[$i] == ')' || $i == $s - 1) {
if ($depth < 4) {
if ($i == $s - 1)
$buffer .= $a[$i];
if (strlen($buffer))
if (strlen($key))
if (($key == 'limit' || $key == 'offset') && ((int) $buffer) . "" == $buffer)
$outputStack[$depth][$key] = $buffer;
else
$outputStack[$depth]['filters'][$key] = $buffer;
else
$outputStack[$depth]['fields'][] = $buffer;
}
$buffer = $key = '';
if ($a[$i] == ')') {
array_pop($outputStack);
$depth--;
}
} elseif ($a[$i] == '(') {
if ($depth + 1 < 4) {
$outputStack[$depth]['connections'][$buffer] = array('fields' => array('id'));
$outputStack[$depth + 1] = &$outputStack[$depth]['connections'][$buffer];
}
$depth++;
$buffer = '';
} else
$buffer .= $a[$i];
unset($outputStack, $depth, $buffer, $key, $a, $i);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With