Is there something like a counter variable in regular expression replace?

FMTEYEWTK about Fancy Regexes

Ok, I’m going to go from the simple to the sublime. Enjoy!

Simple s///e Solution

Given this:

#!/usr/bin/perl

$_ = <<"End_of_G&S";
    This particularly rapid,
        unintelligible patter
    isn't generally heard,
        and if it is it doesn't matter!
End_of_G&S

my $count = 0;

Then this:

s{
    \b ( [\w']+ ) \b
}{
    sprintf "(%s)[%d]", $1, ++$count;
}gsex;

produces this

(This)[1] (particularly)[2] (rapid)[3],
    (unintelligible)[4] (patter)[5]
(isn't)[6] (generally)[7] (heard)[8], 
    (and)[9] (if)[10] (it)[11] (is)[12] (it)[13] (doesn't)[14] (matter)[15]!

Interpolated Code in Anon Array Solution

Whereas this:

s/\b([\w']+)\b/#@{[++$count]}=$1/g;

produces this:

#1=This #2=particularly #3=rapid,
    #4=unintelligible #5=patter
#6=isn't #7=generally #8=heard, 
    #9=and #10=if #11=it #12=is #13=it #14=doesn't #15=matter!

Solution with code in LHS instead of RHS

This puts the incrementation within the match itself:

s/ \b ( [\w']+ ) \b (?{ $count++ }) /#$count=$1/gx;

yields this:

#1=This #2=particularly #3=rapid,
    #4=unintelligible #5=patter
#6=isn't #7=generally #8=heard, 
    #9=and #10=if #11=it #12=is #13=it #14=doesn't #15=matter!

A Stuttering Stuttering Solution Solution Solution

This

s{ \b ( [\w'] + ) \b             }
 { join " " => ($1) x ++$count   }gsex;

generates this delightful answer:

This particularly particularly rapid rapid rapid,
    unintelligible unintelligible unintelligible unintelligible patter patter patter patter patter
isn't isn't isn't isn't isn't isn't generally generally generally generally generally generally generally heard heard heard heard heard heard heard heard, 
    and and and and and and and and and if if if if if if if if if if it it it it it it it it it it it is is is is is is is is is is is is it it it it it it it it it it it it it doesn't doesn't doesn't doesn't doesn't doesn't doesn't doesn't doesn't doesn't doesn't doesn't doesn't doesn't matter matter matter matter matter matter matter matter matter matter matter matter matter matter matter!

Exploring Boundaries

There are more robust approaches to word boundaries that work for plural possessives (the previous approaches don’t), but I suspect your mystery lies in getting the ++$count to fire, not with the subtleties of \b behavior.

I really wish people understood that \b isn’t what they think it is. They always think it means there's white space or the edge of the string there. They never think of it as \w\W or \W\w transitions.

# same as using a \b before:
(?(?=\w) (?<!\w)  | (?<!\W) )

# same as using a \b after:
(?(?<=\w) (?!\w)  | (?!\W)  )

As you see, it's conditional depending on what it's touching. That’s what the (?(COND)THEN|ELSE) clause is for.

This becomes an issue with things like:

$_ = qq('Tis Paul's parents' summer-house, isn't it?\n);
my $count = 0;

s{
    (?(?=[\-\w']) (?<![\-\w'])  | (?<![^\-\w']) )
    ( [\-\w'] + )
    (?(?<=[\-\w']) (?![\-\w'])  | (?![^\-\w'])  )
}{
    sprintf "(%s)[%d]", $1, ++$count
}gsex;

print;

which correctly prints

('Tis)[1] (Paul's)[2] (parents')[3] (summer-house)[4], (isn't)[5] (it)[6]?

Worrying about Unicode

1960s-style ASCII is about 50 years out of date. Just as whenever you see anyone write [a-z], it’s nearly always wrong, it turns out that things like dashes and quotation marks shouldn’t show up as literals in patterns, either. While we’re at it, you probably don’t want to use \w, because that includes numbers and underscores as well, not just alphabetics.

Imagine this string:

$_ = qq(\x{2019}Tis Ren\x{E9}e\x{2019}s great\x{2010}grandparents\x{2019} summer\x{2010}house, isn\x{2019}t it?\n);

which you could have as a literal with use utf8:

use utf8;
$_ = qq(’Tis Renée’s great‐grandparents’ summer‐house, isn’t it?\n);

This time I’ll go at the pattern a bit differently, separating out my definition of terms from their execution to try to make it more readable and thence maintainable:

#!/usr/bin/perl -l
use 5.10.0;
use utf8;
use open qw< :std :utf8 >;
use strict;
use warnings qw< FATAL all >;
use autodie;

$_ = q(’Tis Renée’s great‐grandparents’ summer‐house, isn’t it?);

my $count = 0;

s{ (?<WORD> (?&full_word)  )

   # the rest is just definition
   (?(DEFINE)

     (?<word_char>   [\p{Alphabetic}\p{Quotation_Mark}] )

     (?<full_word>

             # next line won't compile cause
             # fears variable-width lookbehind
             ####  (?<! (?&word_char) )   )
             # so must inline it

         (?<! [\p{Alphabetic}\p{Quotation_Mark}] )

         (?&word_char)
         (?:
             \p{Dash}
           | (?&word_char)
         ) *

         (?!  (?&word_char) )
     )

   )   # end DEFINE declaration block

}{
    sprintf "(%s)[%d]", $+{WORD}, ++$count;
}gsex;

print;

That code when run produces this:

(’Tis)[1] (Renée’s)[2] (great‐grandparents’)[3] (summer‐house)[4], (isn’t)[5] (it)[6]?

Ok, so that may have beeen FMTEYEWTK about fancy regexes, but aren’t you glad you asked? ☺

Related questions
                            
                                Regular expression - PCRE does not support \L, \l, \N, \P,
                            
                                How to combine multiple regex into single one in python?
                            
                                mySQL regex in the where clause
                            
                                Parse Apache log in PHP using preg_match
                            
                                Open webpage and parse it using JavaScript
                            
                                C# Regex.Split: Removing empty results
                            
                                Rationale for Matcher throwing IllegalStateException when no 'matching' method is called
                            
                                What does a-z-A-Z mean in a regular expression?
                            
                                how can I get sed to quit after the first matching address range?
                            
                                Who defines regular expressions?
                            
                                javascript multiline regexp replace
                            
                                Kleene's Star: why does $_ = "a"; s/a*/e/g produce: ee
                            
                                Python regex match space only
                            
                                How can I perform a partial match with java.util.regex.*?
                            
                                Regular expressions in SQLalchemy queries?
                            
                                Looking for a replace-in-string function in elisp
                            
                                Remove all punctuation except apostrophes in R
                            
                                Matching only a unicode letter in Python re
                            
                                Regex: Match any punctuation character except . and _
                            
                                Get video id from Vimeo url

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is there something like a counter variable in regular expression replace?

Tags:

language-agnostic

regex

People also ask