Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PHP Regex preg_match extraction

Although I have enough knowledge of regex in pseudocode, I'm having trouble to translate what I want to do in php regex perl.
I'm trying to use preg_match to extract part of my expression.
I have the following string ${classA.methodA.methodB(classB.methodC(classB.methodD)))} and i need to do 2 things:

a. validate the syntax

  • ${classA.methodA.methodB(classB.methodC(classB.methodD)))} valid
  • ${classA.methodA.methodB} valid
  • ${classA.methodA.methodB()} not valid
  • ${methodB(methodC(classB.methodD)))} not valid

b. I need to extract those information ${classA.methodA.methodB(classB.methodC(classB.methodD)))} should return

 1. classA
 2. methodA
 3. methodB(classB.methodC(classB.methodD)))

I've created this code

$expression = '${myvalue.fdsfs.fsdf.blo(fsdf.fsfds(fsfs.fs))}';
$pattern = '/\$\{(?:([a-zA-Z0-9]+)\.)(?:([a-zA-Z\d]+)\.)*([a-zA-Z\d.()]+)\}/';
if(preg_match($pattern, $expression, $matches))
{
    echo 'found'.'<br/>';
    for($i = 0; $i < count($matches); $i++)
        echo $i." ".$matches[$i].'<br/>';
}

The result is :
found
0 ${myvalue.fdsfs.fsdf.blo(fsdf.fsfds(fsfs.fs))}
1 myvalue
2 fsdf
3 blo(fsdf.fsfds(fsfs.fs))

Obviously I'm having difficult to extract repetitive methods and it is not validating it properly (honestly I left it for last once i solve the other problem) so empty parenthesis are allowed and it is not checking whether or not that once a parenthesis is opened it must be closed.

Thanks all

UPDATE

X m.buettner

Thanks for your help. I did a fast try to your code but it gives a very small issue, although i can by pass it. The issue is the same of one of my prior codes that i didn't post here which is when i try this string :

$expression = '${myvalue.fdsfs}';

with your pattern definition it shows :

found
0 ${myvalue.fdsfs}
1 myvalue.fdsfs
2 myvalue
3 
4 fdsfs

As you can see the third line is catched as a white space which is not present. I couldn't understand why it was doing that so can you suggest me how to or i do have to live with it due to php regex limits?

That said i just can tell you thank you. Not only you answered to my problem but also you tried to input as much as information as possible with many suggestion on proper path to follow when developing patterns.

One last thing i (stupid) forgot to add one little important case which is multiple parameters divided by a comma so

$expression = '${classA.methodAA(classB.methodBA(classC.methodCA),classC.methodCB)}';
$expression = '${classA.methodAA(classB.methodBA(classC.methodCA),classC.methodCB,classD.mehtodDA)}';

must be valid.

I edited to this

    $expressionPattern =             
        '/
        ^                   # beginning of the string
        [$][{]              # literal ${
        (                   # group 1, used for recursion
          (                 # group 2 (class name)
            [a-z\d]+        # one or more alphanumeric characters
          )                 # end of group 2 (class name)
          [.]               # literal .
          (                 # group 3 (all intermediate method names)
            (?:             # non-capturing group that matches a single method name
              [a-z\d]+      # one or more alphanumeric characters
              [.]           # literal .
            )*              # end of method name, repeat 0 or more times
          )                 # end of group 3 (intermediate method names);
          (                 # group 4 (final method name and arguments)
            [a-z\d]+        # one or or more alphanumeric characters
            (?:             # non-capturing group for arguments
              [(]           # literal (
              (?1)          # recursively apply the pattern inside group 1
                (?:     # non-capturing group for multiple arguments        
                  [,]       # literal ,
                  (?1)      # recursively apply the pattern inside group 1 on parameters
                )*          # end of multiple arguments group; repeat 0 or more times
              [)]           # literal )
            )?              # end of argument-group; make optional
          )                 # end of group 4 (method name and arguments)  
        )                   # end of group 1 (recursion group)
        [}]                 # literal }
        $                   # end of the string
        /ix';   

X Casimir et Hippolyte

Your suggestion also is good but it implies a little complex situation when using this code. I mean the code itself is easy to understand but it get less flexible. That said it also gave me a lot of information that surely can be helpful in the future.

X Denomales

Thanks for your support but your code falls when i try this :

$sourcestring='${classA1.methodA0.methodA1.methodB1(classB.methodC(classB.methodD))}';

the result is :

Array

( [0] => Array ( [0] => ${classA1.methodA0.methodA1.methodB1(classB.methodC(classB.methodD))} )

[1] => Array
    (
        [0] => classA1
    )

[2] => Array
    (
        [0] => methodA0
    )

[3] => Array
    (
        [0] => methodA1.methodB1(classB.methodC(classB.methodD))
    )

)

It should be

    [2] => Array
    (
        [0] => methodA0.methodA1
    )

[3] => Array
    (
        [0] => methodB1(classB.methodC(classB.methodD))
    )

)

or

[2] => Array
    (
        [0] => methodA0
    )

[3] => Array
    (
        [0] => methodA1
    )

[4] => Array
    (
        [0] => methodB1(classB.methodC(classB.methodD))
    )

)
like image 303
user2463968 Avatar asked Jun 07 '13 15:06

user2463968


3 Answers

This is a tough one. Recursive patterns are often beyond what's possible with regular expressions and even if it is possible, it can lead to very hard to expressions that are very hard to understand and maintain.

You are using PHP and therefore PCRE, which indeed supports the recursive regex constructs (?n). As your recursive pattern is quite regular it is possible to find a somewhat practical solution using regex.

One caveat I should mention right away: since you allow and arbitrary number of "intermediate" method calls per level (in your snippet fdsfs and fsdf), you can not get all of these in separate captures. That is simply impossible with PCRE. Each match will always yield the same finite number of captures, determined by the amount of opening parentheses your pattern contains. If a capturing group is used repeatedly (e.g. using something like ([a-z]+\.)+) then every time the group is used the previous capture will be overwritten and you only get the last instance. Therefore, I recommend that you capture all the "intermediate" method calls together, and then simply explode that result.

Likewise you couldn't (if you wanted to) get the captures of multiple nesting levels at once. Hence, your desired captures (where the last one includes all nesting levels) are the only option - you can then apply the pattern again to that last match to go a level further down.

Now for the actual expression:

$pattern = '/
    ^                     # beginning of the string
    [$][{]                # literal ${
    (                     # group 1, used for recursion
      (                   # group 2 (class name)
        [a-z\d]+          # one or more alphanumeric characters
      )                   # end of group 2 (class name)
      [.]                 # literal .
      (                   # group 3 (all intermediate method names)
        (?:               # non-capturing group that matches a single method name
          [a-z\d]+        # one or more alphanumeric characters
          [.]             # literal .
        )*                # end of method name, repeat 0 or more times
      )                   # end of group 3 (intermediate method names);
      (                   # group 4 (final method name and arguments)
        [a-z\d]+          # one or or more alphanumeric characters
        (?:               # non-capturing group for arguments
          [(]             # literal (
          (?1)            # recursively apply the pattern inside group 1
          [)]             # literal )
        )?                # end of argument-group; make optional
      )                   # end of group 4 (method name and arguments)  
    )                     # end of group 1 (recursion group)
    [}]                   # literal }
    $                     # end of the string
    /ix';

A few general notes: for complicated expressions (and in regex flavors that support it), always use the free-spacing x modifier which allows you to introduce whitespace and comments to format the expression to your desires. Without them, the pattern looks like this:

'/^[$][{](([a-z\d]+)[.]((?:[a-z\d]+[.])*)([a-z\d]+(?:[(](?1)[)])?))[}]$/ix'

Even if you've written the regex yourself and you are the only one who ever works on the project - try understanding this a month from now.

Second, I've slightly simplified the pattern by using the case-insenstive i modifier. It simply removes some clutter, because you can omit the upper-case variants of your letters.

Third, note that I use single-character classes like [$] and [.] to escape characters where this is possible. That is simply a matter of taste, and you are free to use the backslash variants. I just personally prefer the readability of the character classes (and I know others here disagree), so I wanted to present you this option as well.

Fourth, I've added anchors around your pattern, so that there can be no invalid syntax outside of the ${...}.

Finally, how does the recursion work? (?n) is similar to a backreference \n, in that it refers to capturing group n (counted by opening parentheses from left to right). The difference is that a backreference tries to match again what was matched by group n, whereas (?n) applies the pattern again. That is (.)\1 matches any characters twice in a row, whereas (.)(?1) matches any character and then applies the pattern again, hence matching another arbitrary character. If you use one of those (?n) constructs within the nth group, you get recursion. (?0) or (?R) refers to the entire pattern. That is all the magic there is.

The above pattern applied to the input

 '${abc.def.ghi.jkl(mno.pqr(stu.vwx))}'

will result in the captures

0 ${abc.def.ghi.jkl(mno.pqr(stu.vwx))}
1 abc.def.ghi.jkl(mno.pqr(stu.vwx))
2 abc
3 def.ghi.
4 jkl(mno.pqr(stu.vwx))

Note that there are a few differences to the outputs you actually expected:

0 is the entire match (and in this case just the input string again). PHP will always report this first, so you cannot get rid of it.

1 is the first capturing group which encloses the recursive part. You don't need this in the output, but (?n) unfortunately cannot refer to non-capturing groups, so you need this as well.

2 is the class name as desired.

3 is the list of intermediate method names, plus a trailing period. Using explode it's easy to extract all the method names from this.

4 is the final method name, with the optional (recursive) argument list. Now you could take this, and apply the pattern again if necessary. Note that for a completely recursive approach you might want to modify the pattern slightly. That is: strip off the ${ and } in a separate first step, so that the entire pattern has the exact same (recursive) pattern as the final capture, and you can use (?0) instead of (?1). Then match, remove method name, and parentheses, and repeat, until you get no more parentheses in the last capture.

For more information on recursion, have a look at PHP's PCRE documentation.


To illustrate my last point, here is a snippet that extracts all elements recursively:

if(!preg_match('/^[$][{](.*)[}]$/', $expression, $matches))
    echo 'Invalid syntax.';
else
    traverseExpression($matches[1]);

function traverseExpression($expression, $level = 0) {
    $pattern = '/^(([a-z\d]+)[.]((?:[a-z\d]+[.])*)([a-z\d]+(?:[(](?1)[)])?))$/i';
    if(preg_match($pattern, $expression, $matches)) {
        $indent = str_repeat(" ", 4*$level);
        echo $indent, "Class name: ", $matches[2], "<br />";
        foreach(explode(".", $matches[3], -1) as $method)
            echo $indent, "Method name: ", $method, "<br />";
        $parts = preg_split('/[()]/', $matches[4]);
        echo $indent, "Method name: ", $parts[0], "<br />";
        if(count($parts) > 1) {
            echo $indent, "With arguments:<br />";
            traverseExpression($parts[1], $level+1);
        }
    }
    else
    {
        echo 'Invalid syntax.';
    }
}

Note again, that I do not recommend using the pattern as a one-liner, but this answer is already long enough.

like image 150
Martin Ender Avatar answered Oct 27 '22 17:10

Martin Ender


you can do validation and extraction with the same pattern, example:

$subjects = array(
'${classA.methodA.methodB(classB.methodC(classB.methodD))}',
'${classA.methodA.methodB}',
'${classA.methodA.methodB()}',
'${methodB(methodC(classB.methodD))}',
'${classA.methodA.methodB(classB.methodC(classB.methodD(classC.methodE)))}',
'${classA.methodA.methodB(classB.methodC(classB.methodD(classC.methodE())))}'
);

$pattern = <<<'LOD'
~
# definitions
(?(DEFINE)(?<vn>[a-z]\w*+))

# pattern
^\$\{
    (?<classA>\g<vn>)\.
    (?<methodA>\g<vn>)\.
    (?<methodB>
        \g<vn> ( 
            \( \g<vn> \. \g<vn> (?-1)?+ \)
        )?+
    )
}$

~x
LOD;

foreach($subjects as $subject) {
    echo "\n\nsubject: $subject";
    if (preg_match($pattern, $subject, $m))
        printf("\nclassA: %s\nmethodA: %s\nmethodB: %s",
            $m['classA'], $m['methodA'], $m['methodB']);
    else
        echo "\ninvalid string";    
}

Regex explanation:
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯

At the end of the pattern you can see the modifier x that allow spaces, newlines and commentary inside the pattern.

First the pattern begin with the definition of a named group vn (variable name), here you can define how classA or methodB looks like for all the pattern. Then you can refer to this definition in all the pattern with \g<vn>

Note that you can define if you want different type of name for classes and method adding other definitions. Example:

(?(DEFINE)(?<cn>....))  # for class name
(?(DEFINE)(?<mn>....))  # for method name 

The pattern itself:

(?<classA>\g<vn>) capture in the named group classA with the pattern defined in vn

same thing for methodA

methodB is different cause it can contain nested parenthesis, it's the reason why i use a recursive pattern for this part.

Detail:

\g<vn>         # the method name (methodB)
(              # open a capture group
    \(         # literal opening parenthesis
    \g<vn> \. \g<vn> # for classB.methodC⑴
    (?-1)?+    # refer the last capture group (the actual capture group)
               # one or zero time (possessive) to allow the recursion stop
               # when there is no more level of parenthesis
    \)         # literal closing parenthesis
)?+            # close the capture group 
               # one or zero time (possessive)
               # to allow method without parameters

you can replace it by \g<vn>(?>\.\g<vn>)+ if you want to allow more than one method.

About possessive quantifiers:

You can add + after a quantifier ( * + ? ) to make it possessive, the advantage is that the regex engine know that it don't have to backtrack to test other ways to match with a subpattern. The regex is then more efficient.

like image 43
Casimir et Hippolyte Avatar answered Oct 27 '22 19:10

Casimir et Hippolyte


Description

This expression will match and capture only ${classA.methodA.methodB(classB.methodC(classB.methodD)))} or ${classA.methodA.methodB} formats.

(?:^|\n|\r)[$][{]([^.(}]*)[.]([^.(}]*)[.]([^(}]*(?:[(][^}]+[)])?)[}](?=\n|\r|$)

enter image description here

Groups

Group 0 gets the entire match from the start dollar sign to the close squiggly bracket

  1. gets the Class
  2. gets the first method
  3. gets the second method followed by all the text upto but not including the close squiggly bracket. If this group has open round brackets which are empty () then this match will fail

PHP Code Example:

<?php
$sourcestring="${classA1.methodA1.methodB1(classB.methodC(classB.methodD)))}
${classA2.methodA2.methodB2}
${classA3.methodA3.methodB3()}
${methodB4(methodC4(classB4.methodD)))}
${classA5.methodA5.methodB5(classB.methodC(classB.methodD)))}";
preg_match_all('/(?:^|\n|\r)[$][{]([^.(}]*)[.]([^.(}]*)[.]([^(}]*(?:[(][^}]+[)])?)[}](?=\n|\r|$)/im',$sourcestring,$matches);
echo "<pre>".print_r($matches,true);
?>

$matches Array:
(
    [0] => Array
        (
            [0] => ${classA1.methodA1.methodB1(classB.methodC(classB.methodD)))}
            [1] => 
${classA2.methodA2.methodB2}
            [2] => 
${classA5.methodA5.methodB5(classB.methodC(classB.methodD)))}
        )

    [1] => Array
        (
            [0] => classA1
            [1] => classA2
            [2] => classA5
        )

    [2] => Array
        (
            [0] => methodA1
            [1] => methodA2
            [2] => methodA5
        )

    [3] => Array
        (
            [0] => methodB1(classB.methodC(classB.methodD)))
            [1] => methodB2
            [2] => methodB5(classB.methodC(classB.methodD)))
        )

)

Disclaimers

  • I added a number to the end of the class and method names to help illistrate what's happening in the groups
  • The sample text provided in the OP does not have balanced open and close round brackets.
  • Although () will be disallowed (()) will be allowed
like image 32
Ro Yo Mi Avatar answered Oct 27 '22 17:10

Ro Yo Mi