Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I highlight consecutive duplicate words with a Perl regular expression?

Tags:

regex

perl

I want a Perl regular expression that will match duplicated words in a string.

Given the following input:

$str = "Thus joyful Troy Troy maintained the the watch of night..."

I would like the following output:

Thus joyful [Troy Troy] maintained [the the] watch of night...
like image 285
muruga Avatar asked Mar 24 '10 03:03

muruga


1 Answers

This is similar to one of the Learning Perl exercises. The trick is to catch all of the repeated words, so you need a "one or more" quantifier on the duplication:

 $str = 'This is Goethe the the the their sentence';

 $str =~ s/\b((\w+)(?:\s+\2\b)+)/[\1]/g;

The features I'm about to use are described in either perlre, when they apply at a pattern, or perlop when they affect how the substitution operator does its work.

If you like the /x flag to add insignificant whitespace and comments:

 $str =~ s/
      \b
      (
         (\w+)
         (?:
          \s+
          \2
          \b
         )+
      )
     /[\1]/xg;

I don't like that \2 though because I hate counting relative positions. I can use the relative backreferences in Perl 5.10. The \g{-1} refers to the immediately preceding capture group:

 use 5.010;
 $str =~ s/
      \b
      (
         (\w+)
         (?:
          \s+
          \g{-1}
          \b
         )+
      )
     /[\1]/xg;

Counting isn't all that great either, so I can use labeled matches:

 use 5.010;
 $str =~ s/
      \b
      (
         (?<word>\w+)
         (?:
          \s+
          \k<word>
          \b
         )+
      )
     /[\1]/xg;

I can label the first capture ($1) and access its value in %+ later:

 use 5.010;
 $str =~ s/
      \b
      (?<dups>
         (?<word>\w+)
         (?:
          \s+
          \k<word>
          \b
         )+
      )
     /[$+{dups}]/xg;

I shouldn't really need that first capture though since it's really just there to refer to everything that matched. Sadly, it looks like ${^MATCH} isn't set early enough for me to use it in the replacement side. I think that's a bug. This should work but doesn't:

 $str =~ s/
      \b
         (?<word>\w+)
         (?:
          \s+
          \k<word>
          \b
         )+
     /[${^MATCH}]/pgx;   # DOESN'T WORK

I'm checking this on blead, but that's going to take a little while to compile on my tiny machine.

like image 90
3 revs Avatar answered Nov 15 '22 07:11

3 revs