Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PHP: tokenizing, using a regular Expression (mostly there)

I want to tokenize formatting strings (very roughly like printf) and I think I am only missing a small bit:

  • %[number][one letter ctYymd] shall become a token²
  • $1...$10 shall become a token
  • all else (normal text) becomes a token.

I got quite far in the regExp simulator. This looks like it should do:

²update: now using # instead of %. (Less troubles with windows command line parameters)

enter image description here

It's not scary, if you focus on the three parts, connected by pipes (as either-or), so basically it's just three matches. Since I want to match from start to end, I wrapped things in /^...%/ and surrounded by a non-matching group (?:... that may repeat 1 or more times:

$exp = '/^(?:(%\\d*[ctYymd]+)|([^$%]+)|(\\$\\d))+$/'; 

Still my source doesn't deliver:

$exp = '/^(?:(%\\d*[ctYymd]+)|([^$%]+)|(\\$\\d))+$/';
echo "expression: $exp \n";

$tests = [
        '###%04d_Ball0n%02d$1',
        '%03d_Ball0n%02x$1%03d_Ball0n%02d$1',
        '%3d_Ball0n%02d',
    ];

foreach ( $tests as $test )
{
    echo "teststring: $test\n";
    if( preg_match( $exp, $test, $tokens) )
    {
        array_shift($tokens);
        foreach ( $tokens as $token )
            echo "\t\t'$token'\n";
    }
    else
        echo "not valid.";
} // foreach

I get results but: Matches are out of order. The first %[number][letter] never matches, therefore others match double:

expression: /^((%\d*[ctYymd]+)|([^$%]+)|(\$\d))+$/ 
teststring: ###%04d_Ball0n%02d$1
        '$1'
        '%02d'
        '_Ball0n'
        '$1'
teststring: %03d_Ball0n%02x$1%03d_Ball0n%02d$1
not valid.teststring: %3d_Ball0n%02d
        '%02d'
        '%02d'
        '_Ball0n'
teststring: %d_foobardoo
        '_foobardoo'
        '%d'
        '_foobardoo'
teststring: Ball0n%02dHamburg%d
        '%d'
        '%d'
        'Hamburg'
like image 684
Frank Nocke Avatar asked Oct 31 '22 16:10

Frank Nocke


1 Answers

Solution (edited by OP): I use a two slight variations (only regarding ‘wrapping’): first for validation, then for tokenizing, of:

#\d*[ctYymd]+|\$\d+|[^#\$]+

RegEx Demo

Code:

$core = '#\d*[ctYymd]+|\$\d+|[^#\$]+';
$expValidate = '/^('.$core.')+$/m';
$expTokenize = '/('.$core.')/m';

$tests = [
        '#3d-',
        '#3d-ABC',
        '***#04d_Ball0n#02d$1',
        '#03d_Ball0n#02x$AwrongDollar',
        '#3d_Ball0n#02d',
        'Badstring#02xWrongLetterX'
    ];

foreach ( $tests as $test )
{
    echo "teststring: [$test]\n";

    if( ! preg_match_all( $expValidate, $test) )
    {
        echo "not valid.\n";
        continue;
    }
    if( preg_match_all( $expTokenize, $test, $tokens) ) {
        foreach ( $tokens[0] as $token )
            echo "\t\t'$token'\n";
    }

} // foreach

Output:

teststring: [#3d-]
        '#3d'
        '-'
teststring: [#3d-ABC]
        '#3d'
        '-ABC'
teststring: [***#04d_Ball0n#02d$1]
        '***'
        '#04d'
        '_Ball0n'
        '#02d'
        '$1'
teststring: [#03d_Ball0n#02x$AwrongDollar]
not valid.
teststring: [#3d_Ball0n#02d]
        '#3d'
        '_Ball0n'
        '#02d'
teststring: [Badstring#02xWrongLetterX]
not valid.
like image 112
anubhava Avatar answered Nov 15 '22 05:11

anubhava