Calculate Number of Consecutive Characters in a String using Perl

Question

I have a string with multiple sequences of consecutive characters like:

aaabbcccdddd

I want to represent this as: a3b2c3d4

As of now, I have come up with this:

#! /usr/bin/perl

$str = "aaabbcccdddd";
$str =~ s/(.)\1+/$1/g;

print $str."
";

Output:

abcd

It stores the consecutive characters in the capture buffer and returns only one. However, I want a way to count the number of consecutive characters in the capture buffer and then display only one character followed by that count so that it displays the output as a3b2c3d4 instead of abcd.

What modification is required to the above regex?

Jonathan Leffler · Accepted Answer

This seems to require the 'execute' option on the substitute command so the replacement text is treated as a fragment of Perl code:

 $str =~ s/((.)\2+)/$2 . length($1)/ge;

Script

#!/usr/bin/env perl
use strict;
use warnings;

my $original = "aaabbcccdddd";
my $alternative = "aaabbcccddddeffghhhhhhhhhhhh";

sub proc1
{
    my($str) = @_;
    $str =~ s/(.)\1+/$1/g;
    print "$str
";
}

proc1 $original;
proc1 $alternative;

sub proc2
{
    my($str) = @_;
    $str =~ s/((.)\2+)/$2 . length($1)/ge;
    print "$str
";
}

proc2 $original;
proc2 $alternative;

Output

abcd
abcdefgh
a3b2c3d4
a3b2c3d4ef2gh12

Could you please break down the regular expression to explain how it works?

I'm assuming it is the match part that is problematic and not the replacement part.

The original regex is:

(.)\1+

This captures a single character (.) that is followed by the same character repeated one or more times.

The revised regex is 'the same', but also captures the whole pattern:

((.)\2+)

The first open parenthesis starts the overall capture; the second open parenthesis starts the capture of a single character. But, it is now the second capture, so the \1 in the original needs to become \2 in the revision.

Because the search captures the whole string of repeated characters, the replacement can determine the length of the pattern easily.

Zaid · Answer

The following works if you can live with the slow-down caused by $&:

$str =~ s/(.)\1*/$1. length $&/ge;

Changing the * to + in the above expression leaves non-consecutive characters untouched.

As JRFerguson reminds, Perl 5.10+ provides an equivalent ${^MATCH} variable that does not affect regex performance:

$str =~ s/(.)\g{1}+/$1. length ${^MATCH}/pge;

For Perl 5.6+, the performance hit can still be avoided:

$str =~ s/(.)\g{1}+/ $1. ( $+[0] - $-[0] ) /ge;

José Pablo Orozco Marín · Answer

JS:

let data = "ababaaaabbbababb";

data.replace(/((.)\2+)/g, (match, p1, p2) =>  {
  data = data.replace(new RegExp(p1, 'g'), p2 + p1.length);
});

console.log(data);

Calculate Number of Consecutive Characters in a String using Perl

Tags:

regex

perl

Neon Flash

3 Answers

Script

Output

Jonathan Leffler

Zaid

José Pablo Orozco Marín

Recent Activity

Donate For Us

Calculate Number of Consecutive Characters in a String using Perl

Tags:

regex

perl

Neon Flash

3 Answers

Script

Output

Jonathan Leffler

Zaid

José Pablo Orozco Marín

Related questions

Recent Activity

Donate For Us