Parsing plain text in such a way that will recognise a custom if statement

Question

I have the following string:

$string = "The man has {NUM_DOGS} dogs."

I'm parsing this by running it through the following function:

function parse_text($string)
{
    global $num_dogs;

    $string = str_replace('{NUM_DOGS}', $num_dogs, $string);

    return $string;
}

parse_text($string);

Where $num_dogs is a preset variable. Depending on $num_dogs, this could return any of the following strings:

The man has 1 dogs.
The man has 2 dogs.
The man has 500 dogs.

The problem is that in the case that "the man has 1 dogs", dog is pluralised, which is undesired. I know that this could be solved simply by not using the parse_text function and instead doing something like:

if($num_dogs = 1){
    $string = "The man has 1 dog.";
}else{
    $string = "The man has $num_dogs dogs.";
}

But in my application I'm parsing more than just {NUM_DOGS} and it'd take a lot of lines to write all the conditions.

I need a shorthand way which I can write into the initial $string which I can run through a parser, which ideally wouldn't limit me to just two true/false possibilities.

For example, let

$string = 'The man has {NUM_DOGS} [{NUM_DOGS}|0=>"dogs",1=>"dog called fred",2=>"dogs called fred and harry",3=>"dogs called fred, harry and buster"].';

Is it clear what's happened at the end? I've attempted to initiate the creation of an array using the part inside the square brackets that's after the vertical bar, then compare the key of the new array with the parsed value of {NUM_DOGS} (which by now will be the $num_dogs variable at the left of the vertical bar), and return the value of the array entry with that key.

If that's not totally confusing, is it possible using the preg_* functions?

Leigh · Accepted Answer

The premise of your question is that you want to match a specific pattern and then replace it after performing additional processing on the matched text.

Seems like an ideal candidate for preg_replace_callback

The regular expressions for capturing matched parenthesis, quotes, braces etc. can become quite complicated, and to do it all with a regular expression is in fact quite inefficient. In fact you'd need to write a proper parser if that's what you require.

For this question I'm going to assume a limited level of complexity, and tackle it with a two stage parse using regex.

First of all, the most simple regex I can think off for capturing tokens between curly braces.

/{([^}]+)}/

Lets break that down.

{        # A literal opening brace
(        # Begin capture
  [^}]+  # Everything that's not a closing brace (one or more times)
)        # End capture
}        # Literal closing brace

When applied to a string with preg_match_all the results look something like:

array (
  0 => array (
    0 => 'A string {TOK_ONE}',
    1 => ' with {TOK_TWO|0=>"no", 1=>"one", 2=>"two"}',
  ),
  1 => array (
    0 => 'TOK_ONE',
    1 => 'TOK_TWO|0=>"no", 1=>"one", 2=>"two"',
  ),
)

Looks good so far.

Please note that if you have nested braces in your strings, i.e. {TOK_TWO|0=>"hi {x} y"}, this regex will not work. If this wont be a problem, skip down to the next section.

It is possible to do top-level matching, but the only way I have ever been able to do it is via recursion. Most regex veterans will tell you that as soon as you add recursion to a regex, it stops being a regex.

This is where the additional processing complexity kicks in, and with long complicated strings it's very easy to run out of stack space and crash your program. Use it carefully if you need to use it at all.

The recursive regex taken from one of my other answers and modified a little.

`/{((?:[^{}]*|(?R))*)}/`

Broken down.

{                   # literal brace
(                   # begin capture
    (?:             # don't create another capture set
        [^{}]*      # everything not a brace
        |(?R)       # OR recurse
    )*              # none or more times
)                   # end capture
}                   # literal brace

And this time the ouput only matches top-level braces

array (
  0 => array (
    0 => '{TOK_ONE|0=>"a {nested} brace"}',
  ),
  1 => array (
    0 => 'TOK_ONE|0=>"a {nested} brace"',
  ),
)

Again, don't use the recursive regex unless you have to. (Your system may not even support them if it has an old PCRE library)

With that out of the way we need to work out if the token has options associated with it. Instead of having two fragments to be matched as per your question, I'd recommend keeping the options with the token as per my examples. {TOKEN|0=>"option"}

Lets assume $match contains a matched token, if we check for a pipe |, and take the substring of everything after it we'll be left with your list of options, again we can use regex to parse them out. (Don't worry I'll bring everything together at the end)

/(\d)+\s*=>\s*"([^"]*)",?/

Broken down.

(\d)+    # Capture one or more decimal digits
\s*      # Any amount of whitespace (allows you to do 0    =>    "")
=>       # Literal pointy arrow
\s*      # Any amount of whitespace
"        # Literal quote
([^"]*)  # Capture anything that isn't a quote
"        # Literal quote
,?       # Maybe followed by a comma

And an example match

array (
  0 => array (
    0 => '0=>"no",',
    1 => '1 => "one",',
    2 => '2=>"two"',
  ),
  1 => array (
    0 => '0',
    1 => '1',
    2 => '2',
  ),
  2 => array (
    0 => 'no',
    1 => 'one',
    2 => 'two',
  ),
)

If you want to use quotes inside your quotes, you'll have to make your own recursive regex for it.

Wrapping up, here's a working example.

Some initialisation code.

$options = array(
    'WERE' => 1,
    'TYPE' => 'cat',
    'PLURAL' => 1,
    'NAME' => 2
);

$string = 'There {WERE|0=>"was a",1=>"were"} ' .
    '{TYPE}{PLURAL|1=>"s"} named bob' . 
    '{NAME|1=>" and bib",2=>" and alice"}';

And everything together.

$string = preg_replace_callback('/{([^}]+)}/', function($match) use ($options) {
    $match = $match[1];

    if (false !== $pipe = strpos($match, '|')) {
        $tokens = substr($match, $pipe + 1);
        $match = substr($match, 0, $pipe);
    } else {
        $tokens = array();
    }

    if (isset($options[$match])) {
        if ($tokens) {
            preg_match_all('/(\d)+\s*=>\s*"([^"]*)",?/', $tokens, $tokens);

            $tokens = array_combine($tokens[1], $tokens[2]);

            return $tokens[$options[$match]];
        }
        return $options[$match];
    }
    return '';
}, $string);

Please note the error checking is minimal, there will be unexpected results if you pick options that don't exist.

There's probably a lot simpler way to do all of this, but I just took the idea and ran with it.

Mike · Answer

First of all, it is a bit debatable, but if you can easily avoid it, just pass $num_dogs as an argument to the function as most people believe global variables are evil!

Next, for the getting the "s", I generally do something like this:

$dogs_plural = ($num_dogs == 1) ? '' : 's';

Then just do something like this:

$your_string = "The man has $num_dogs dog$dogs_plural";

It's essentially the same thing as doing an if/else block, but less lines of code and you only have to write the text once.

As for the other part, I am STILL confused about what you're trying to do, but I believe you are looking for some sort of way to convert

{NUM_DOGS}|0=>"dogs",1=>"dog called fred",2=>"dogs called fred and harry",3=>"dogs called fred, harry and buster"]

into:

switch $num_dogs {
    case 0:
        return 'dogs';
        break;
    case 1:
        return 'dog called fred';
        break;
    case 2:
        return 'dogs called fred and harry';
        break;
    case 3:
        return 'dogs called fred, harry and buster';
        break;
}

The easiest way is to try to use a combination of explode() and regex to then get it to do something like I have above.

jmalloc · Answer

In a pinch, I have done something similar to what you're asking with an implementation vaguely like the code below.

This is nowhere near as feature rich as @Mike's answer, but it has done the trick in the past.

/**
 * This function pluralizes words, as appropriate.
 *
 * It is a completely naive, example-only implementation.
 * There are existing "inflector" implementations that do this
 * quite well for many/most *English* words.
 */
function pluralize($count, $word)
{
    if ($count === 1)
    {
        return $word;
    }
    return $word . 's';
}

/**
 * Matches template patterns in the following forms:
 *   {NAME}       - Replaces {NAME} with value from $values['NAME']
 *   {NAME:word}  - Replaces {NAME:word} with 'word', pluralized using the pluralize() function above.
 */
function parse($template, array $values)
{
    $callback = function ($matches) use ($values) {
        $number = $values[$matches['name']];
        if (array_key_exists('word', $matches)) {
            return pluralize($number, $matches['word']);
        }
        return $number;
    };

    $pattern = '/\{(?<name>.+?)(:(?<word>.+?))?\}/i';
    return preg_replace_callback($pattern, $callback, $template);
}

Here are some examples similar to your original question...

echo parse(
    'The man has {NUM_DOGS} {NUM_DOGS:dog}.' . PHP_EOL,
    array('NUM_DOGS' => 2)
);

echo parse(
    'The man has {NUM_DOGS} {NUM_DOGS:dog}.' . PHP_EOL,
    array('NUM_DOGS' => 1)
);

The output is:

The man has 2 dogs.

The man has 1 dog.

It may be worth mentioning that in larger projects I've invariably ended up ditching any custom rolled inflection in favour of GNU gettext which seems to be the most sane way forward once multi-lingual is a requirement.

Parsing plain text in such a way that will recognise a custom if statement

Tags:

arrays

regex

php

dplanet

3 Answers

Leigh

Mike

jmalloc

Recent Activity

Donate For Us

Parsing plain text in such a way that will recognise a custom if statement

Tags:

arrays

regex

php

dplanet

3 Answers

Leigh

Mike

jmalloc

Related questions

Recent Activity

Donate For Us