Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Explode by commas recursively based on parentheses

Tags:

regex

php

Imagine a string like this:

field1,field2(subfield1),field3(subfield2,subfield3),field4(),field5(subfield4(subsubfield,subsubfield2))

I would like to get an array like this:

array(
    field1 => array(),
    field2 => array(subfield1),
    field3 => array(
        subfield2,
        subfield3
    ),
    field4 => array(),
    field5 => array(
        subfield4 => array(
            subsubfield => array(),
            subsubfield => array()
        )
    )
)

I have this regex [a-zA-Z0-9]*\([^()]*(?:(?R)[^()]*)*\) which does some of the work outputting:

array(
    field1,
    field2(subfield1),
    field3(subfield2,subfield3),
    field4(),
    field5(subfield4(subsubfield,subsubfield2))
)

Though this it not what I want. I'm kind of stuck now but the options I've come up with so far are:

  1. Do something with preg_replace_callback
  2. Write some kind of custom parser for these kinds of hierarchical comma separated string
  3. Quit, get drunk and crack this case tomorrow (ok, not an option, gotta do this now)

For what it's worth, I've got to loop through the fields and subfields anyway. I've got some code that uses the given regex and runs the same match against its values when required later on. I want to parse the whole string including its nested substring at once.

Does anyone know how I get started on this? Which option would be the best (or better) approach? (readability vs resource usage vs complexity vs etc)

like image 283
Rebel Designer Avatar asked Mar 06 '26 12:03

Rebel Designer


2 Answers

Introduction

The problem you describe cannot be represented by a regular language, because regular languages cannot balance parentheses. However, most regular expression implementations have had features added to them over the years that allow for parsing of languages more sophisticated than regular languages. In particular, this problem could be solved with either .NET's balanced matching, or PCRE's recursive expressions (thanks, @Gumbo, for pointing that out in the comments).

However, just because you can do a thing doesn't mean you should. The problem with using a regular expression for this kind of task is that as you extend your metalanguage, it will get exponentially more difficult to modify your regular expression. Whereas parsers tend to be much more plastic and easy to extend.

So you could probably construct a series of regular expressions to cover non-pathological cases of your input, but why even try when you could just write a parser? They're easy to maintain, extremely fast (faster than regular expressions), easy to extend, and a lot of fun to boot.

I originally missed that this question was looking for a PHP solution, so I wrote it in JavaScript. I translated that into PHP, and left the original JavaScript solution at the end of the post.

PHP Solution

function parse( $s ) {
    // we will always have a "current context".  the current context is the array we're
    // currently operating in.  when we start, this is simply an empty array.  as new
    // arrays are created, this context will change.
    $context = array();
    // since we have to keep track of how deep our context is, we keep a context stack
    $contextStack = array(&$context);
    // this accumulates the name of the current array
    $name = '';
    for( $i=0; $i<strlen($s); $i++ ) {
        switch( $s[$i] ) {
            case ',':
                // if the last array hasn't been added to the current context
                // (as will be the case for arrays lacking parens), we add it now
                if( $name!='' && !array_key_exists( $name, $context ) )
                    $context[$name] = array();
                // reset name accumulator
                $name = '';
                break;
            case '(':
                // we are entering a subcontext

                // save a new array in the current context; this will become our new context
                $context[$name] = array();
                // switch context and add to context stack
                $context = &$context[$name];
                $contextStack[] = &$context;
                // reset name accumulator
                $name = '';
                break;
            case ')':
                // we are exiting a context

                // if we haven't saved this array in the current context, do so now
                if( $name!='' && !array_key_exists( $name, $context ) )
                    $context[$name] = array();
                // we can't just assign $context the return value of array_pop because
                // that does not return a reference
                array_pop($contextStack);
                if( count($contextStack) == 0 ) throw new Exception( 'Unmatched parenthesis' );
                $context = &$contextStack[count($contextStack)-1];
                // reset name accumulator
                $name = '';
                break;
            default:
                // this is part of the field name
                $name .= $s[$i];
        }
    }
    // add any trailing arrays to the context (this will cover the case
    // where our input ends in an array without parents)
    if( $name!='' && !array_key_exists( $name, $context ) )
        $context[$name] = array();
    if( count( $contextStack ) != 1 ) throw new Exception( 'Unmatched parenthesis' );
    return array_pop( $contextStack );
}

Original JavaScript Solution

function parse(s) {
    var root = { parent: null, children: [] };
    var field = { parent: root, name: '', start_idx: 0, children: [] };
    root.children.push( field );
    for( var i=0; i<s.length; i++ ) {
        switch( s[i] ) {
            case ',':
                // if this field didn't have any children, we have to set its text
                if( !field.children.length )
                    field.text = s.substr( field.start_idx, i - field.start_idx + 1 );
                // start a new field; create new field and change context
                var newfield = { parent: field.parent, name: '', start_idx: i, children:[] };
                field.parent.children.push(newfield);
                field = newfield;
                break;
            case '(':
                // start of a subfield; create subfield and change context
                var subfield = { parent: field, name: '', start_idx: i, children:[] };
                field.children.push(subfield);
                field = subfield;
                break;
            case ')':
                // end of a subfield; fill out subfield details and change context
                if( !field.parent ) throw new Error( 'Unmatched parenthesis!' );
                field.text = s.substr( field.start_idx, i - field.start_idx + 1 );
                if( field.text==='()' ) {
                    // empty subfield; pop this subfield so it doesn't clutter the parent
                    field.parent.children.pop();
                }
                field = field.parent;
                break;
            default:
                // this is part of the field name
                field.name += s[i];
                field.name = field.name.trim();
        }
    }
    return root;
}

Now that we have a parsed tree of your language, we can pretty easily create some recursive code to emit your PHP:

function formatphp_namedarray(arr,indent,lastchild) {
    var php = indent + arr.name + ' => array(';
    if( arr.children.length ) {
        if( arr.children.length===1 && arr.children[0].length===0 ) {
            php += arr.children[0].name;
        } else {
            php += '\n';
            indent += '\t';
            for( var i=0; i<arr.children.length; i++ )
                php += formatphp_namedarray(arr.children[i],indent,i===arr.children.length-1);
            indent = indent.replace(/\t$/,'');
            php += indent;
        }
    }
    php += (lastchild?')':'),') + '\n';
    return php;
}

function formatphp(t) {
    var php = 'array(\n';
    for( var i=0; i<t.children.length; i++ )
        php += formatphp_namedarray( t.children[i], '\t', i===t.children.length-1 );
    php += ')'
    return php;
}       

See it all working here: http://jsfiddle.net/6bguY/1/

like image 140
Ethan Brown Avatar answered Mar 08 '26 04:03

Ethan Brown


Ok I think this does the trick:

    $a = "field1,field2(subfield1),field3(subfield2,subfield3),field4(),field5(subfield4(subsubfield,subsubfield2,limit:50,offset:0))";
    $output = array();
    $outputStacktrace = array(&$output);
    $depth = 0;
    $buffer = $key = '';
    $m = memory_get_usage();
    for ($i = 0; $i < strlen($a); $i++)
        if ($a[$i] == ':') {
            $key = $buffer;
            $buffer = '';
        } elseif ($a[$i] == ',') {
            if (strlen($buffer))
                $outputStacktrace[$depth][$key ? $key : count($outputStacktrace[$depth])] = $buffer;
            $buffer = $key = '';
        } elseif ($a[$i] == '(') {
            $outputStacktrace[$depth][$buffer] = array();
            $outputStacktrace[$depth + 1] = &$outputStacktrace[$depth][$buffer];
            $depth++;
            $buffer = '';
        } elseif ($a[$i] == ')') {
            if (strlen($buffer))
                $outputStacktrace[$depth][$key ? $key : count($outputStacktrace[$depth])] = $buffer;
            $buffer = $key = '';
            unset($outputStacktrace[$depth]);
            $depth--;
        } else
            $buffer .= $a[$i];
    var_dump($output);

All whithin a single loop. Pretty happy with it. Comments are greatly appreciated!

Edit: including fixed variable names, fixed last non-nested character truncation, added 'special' query elements: limit & offset.

<?php
    $output = array();
    $outputStack = array(&$output);
    $depth = 0;
    $buffer = $key = '';
    $s = count($a);
    for ($i = 0; $i < $s; $i++)
        if ($a[$i] == ':') {
            $key = $buffer;
            $buffer = '';
        } elseif ($a[$i] == ',' || $a[$i] == ')' || $i == $s - 1) {
            if ($depth < 4) {
                if ($i == $s - 1)
                    $buffer .= $a[$i];

                if (strlen($buffer))
                    if (strlen($key))
                        if (($key == 'limit' || $key == 'offset') && ((int) $buffer) . "" == $buffer)
                            $outputStack[$depth][$key] = $buffer;
                        else
                            $outputStack[$depth]['filters'][$key] = $buffer;
                    else
                        $outputStack[$depth]['fields'][] = $buffer;
            }

            $buffer = $key = '';

            if ($a[$i] == ')') {
                array_pop($outputStack);
                $depth--;
            }
        } elseif ($a[$i] == '(') {
            if ($depth + 1 < 4) {
                $outputStack[$depth]['connections'][$buffer] = array('fields' => array('id'));
                $outputStack[$depth + 1] = &$outputStack[$depth]['connections'][$buffer];
            }
            $depth++;
            $buffer = '';
        } else
            $buffer .= $a[$i];
    unset($outputStack, $depth, $buffer, $key, $a, $i);
like image 45
Rebel Designer Avatar answered Mar 08 '26 02:03

Rebel Designer



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!