Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I count overlapping substrings in Perl?

Tags:

string

count

perl

i need to implement a program to count the occurrence of a substring in a string in perl. i have implemented it as follows

sub countnmstr
{
  $count =0;
  $count++ while $_[0] =~ /$_[1]/g;
  return $count;
}

$count = countnmstr("aaa","aa");

print "$count\n";

now this is what i would normally do. however, in the implementation above i want to count occurrence of 'aa' in 'aaa'. here i get answer as 1 which seems reasonable but i need to consider the overlapping cases as well. hence the above case should give an answer as 2 since there are two 'aa's if we consider overlap.

can anyone suggest how to implement such a function??

like image 805
sfactor Avatar asked Jan 22 '10 00:01

sfactor


3 Answers

Everyone is getting pretty complicated in their answers (d'oh! daotoad should have made his comment an answer!), perhaps because they are afraid of the goatse operator. I didn't name it, that's just what people call it. It uses the trick that the result of a list assignment is the number of elements in the righthand list.

The Perl idiom for counting matches is then:

 my $count = () = $_[0] =~ /($pattern)/g;

The goatse part is the = () =, which is an empty list in the middle of two assignments. The lefthand part of the goatse gets the count from the righthand side of the goatse. Note the you need a capture in the pattern because that's the list the match operator will return in list context.

Now, the next trick in your case is that you really want a positive lookbehind (or lookahead maybe). The lookarounds don't consume characters, so you don't need to keep track of the position:

 my $count = () = 'aaa' =~ /((?<=a)a)/g;

Your aaa is just an example. If you have a variable-width pattern, you have to use a lookahead. Lookbehinds in Perl have to be fixed width.

like image 89
brian d foy Avatar answered Nov 05 '22 06:11

brian d foy


See ysth's answer ... I failed to realize that the pattern could consist solely of a zero width assertion and still work for this purpose.

You can use positive lookahead as suggested by others, and write the function as:

sub countnmstr {
    my ($haystack, $needle) = @_;
    my ($first, $rest) = $needle =~ /^(.)(.*)$/;
    return scalar (() = $haystack =~ /(\Q$first\E(?=\Q$rest\E))/g);
}

You can also use pos to adjust where the next search picks up from:

#!/usr/bin/perl

use strict; use warnings;

sub countnmstr {
    my ($haystack, $needle) = @_;
    my $adj = length($needle) - 1;
    die "Search string cannot be empty!" if $adj < 0;

    my $count = 0;
    while ( $haystack =~ /\Q$needle/g ) {
        pos $haystack -= $adj;
        $count += 1;
    }
    return $count;
}

print countnmstr("aaa","aa"), "\n";

Output:

C:\Temp> t
2
like image 8
Sinan Ünür Avatar answered Nov 05 '22 05:11

Sinan Ünür


sub countnmstr
{
    my ($string, $substr) = @_;
    return scalar( () = $string =~ /(?=\Q$substr\E)/g );
}

$count = countnmstr("aaa","aa");

print "$count\n";

A few points:

//g in list context matches as many times as possible.

\Q...\E is used to auto-escape any meta characters, so that you are doing a substring count, not a subpattern count.

Using a lookahead (?= ... ) causes each match to not "consume" any of the string, allowing the following match to be attempted at the very next character.

This uses the same feature where a list assignment (in this case, to an empty list) in scalar context returns the count of elements on the right of the list assignment as the goatse/flying-lentil/spread-eagle/whatever operator, but uses scalar() instead of a scalar assignment to provide the scalar context.

$_[0] is not used directly, but instead copied to a lexical; a naive use of $_[0] in place of $string would cause the //g to start partway through the string instead of at the beginning if the passed string had a stored pos().

Update: s///g is faster, though not as fast as using index:

sub countnmstr
{
    my ($string, $substr) = @_;
    return scalar( $string =~ s/(?=\Q$substr\E)//g );
}
like image 8
ysth Avatar answered Nov 05 '22 05:11

ysth