Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calculate Number of Consecutive Characters in a String using Perl

Tags:

regex

perl

I have a string with multiple sequences of consecutive characters like:

aaabbcccdddd

I want to represent this as: a3b2c3d4

As of now, I have come up with this:

#! /usr/bin/perl

$str = "aaabbcccdddd";
$str =~ s/(.)\1+/$1/g;

print $str."\n";

Output:

abcd

It stores the consecutive characters in the capture buffer and returns only one. However, I want a way to count the number of consecutive characters in the capture buffer and then display only one character followed by that count so that it displays the output as a3b2c3d4 instead of abcd.

What modification is required to the above regex?

like image 321
Neon Flash Avatar asked Jun 10 '12 13:06

Neon Flash


3 Answers

This seems to require the 'execute' option on the substitute command so the replacement text is treated as a fragment of Perl code:

 $str =~ s/((.)\2+)/$2 . length($1)/ge;

Script

#!/usr/bin/env perl
use strict;
use warnings;

my $original = "aaabbcccdddd";
my $alternative = "aaabbcccddddeffghhhhhhhhhhhh";

sub proc1
{
    my($str) = @_;
    $str =~ s/(.)\1+/$1/g;
    print "$str\n";
}

proc1 $original;
proc1 $alternative;

sub proc2
{
    my($str) = @_;
    $str =~ s/((.)\2+)/$2 . length($1)/ge;
    print "$str\n";
}

proc2 $original;
proc2 $alternative;

Output

abcd
abcdefgh
a3b2c3d4
a3b2c3d4ef2gh12

Could you please break down the regular expression to explain how it works?

I'm assuming it is the match part that is problematic and not the replacement part.

The original regex is:

(.)\1+

This captures a single character (.) that is followed by the same character repeated one or more times.

The revised regex is 'the same', but also captures the whole pattern:

((.)\2+)

The first open parenthesis starts the overall capture; the second open parenthesis starts the capture of a single character. But, it is now the second capture, so the \1 in the original needs to become \2 in the revision.

Because the search captures the whole string of repeated characters, the replacement can determine the length of the pattern easily.

like image 74
Jonathan Leffler Avatar answered Oct 28 '22 05:10

Jonathan Leffler


The following works if you can live with the slow-down caused by $&:

$str =~ s/(.)\1*/$1. length $&/ge;

Changing the * to + in the above expression leaves non-consecutive characters untouched.

As JRFerguson reminds, Perl 5.10+ provides an equivalent ${^MATCH} variable that does not affect regex performance:

$str =~ s/(.)\g{1}+/$1. length ${^MATCH}/pge;

For Perl 5.6+, the performance hit can still be avoided:

$str =~ s/(.)\g{1}+/ $1. ( $+[0] - $-[0] ) /ge;
like image 1
Zaid Avatar answered Oct 28 '22 05:10

Zaid


JS:

let data = "ababaaaabbbababb";

data.replace(/((.)\2+)/g, (match, p1, p2) =>  {
  data = data.replace(new RegExp(p1, 'g'), p2 + p1.length);
});

console.log(data);
like image 1
José Pablo Orozco Marín Avatar answered Oct 28 '22 05:10

José Pablo Orozco Marín