Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I quickly count the maximum number of consecutive single characters in a string?

I have a string similar to: but much longer

my $a = "000000001111111111000000011111111111111111111111111111111";

I am counting the number of "1"'s with:

my $total_1_available = $a =~ tr/1//;

And that works amazingly well and is really FAST.

HOWEVER, I also wish to count (in a fast way), the total number of consecutive 1's in a row. The MAX COUNT of "1's" consecutively.

IN the example above, it would return the count of:

11111111111111111111111111111111

As this is the maximum in a row.

So, I end up with the TOTAL_COUNT and also the TOTAL_CONSECUTIVE_COUNT.

I have it working with a REGEXP which basically replaces the 1's and then counts what was replaced and loops around...which actually is totally fine and works... but it doesn't "feel" right.

Ideally I don't want to replace the string at all as I am looking for the max consecutive count.

But, I know in Perl this probably isn't the fastest or cleanest way.

Could you teach me a better way please and increase my learning?

AS requested this is my current code:

 my $a= "0110011001101111";
 my $total_1_available = $a =~ tr/1//;
 print "Total number of 1's = $total_1_available\n";

 my $max_c = 0;
 while ( $a=~s/(1+)/ / ) {
   $max_c = length($1) if length($1) > $max_c;
 }
 print "Consecutive count   = $max_c\n";

And FINAL CODE:

use strict;
use warnings;
use Benchmark ':all';
use String::Random;

## We test 525,600 as this is the length of the string.
## Actually each 0 or 1 represents a minute of the year.
## And these represent engineer minues available in a 24 hr / 365 day year.
## And there are lots and lots of engineers.
## Hence my wish to improve the performance and I wish to thank everyone whom responded.

## there are a lot more 0's than 1's so hack to sort of simulate
my $test_regex = '[0][0][0][0][0][0-1][0-1][0-1][0-1][0-1]' x 52560;
my $pass       = String::Random->new;
my $string     = $pass->randregex($test_regex);

cmpthese(-1, {
    org  => sub { my $max = 0; while ($string =~ /(1+)/g) { my $len = length($1); if ($max < $len) { $max = $len } } },
    hack => sub { my $match = ""; while ($string =~ /(${match}1+)/g) { $match = $1; } length $match }
});

#                BLOWN AWAY !!!!!!
#                BLOWN AWAY !!!!!!
#                BLOWN AWAY !!!!!!
#                BLOWN AWAY !!!!!!
like image 265
Mark Arnold Avatar asked Dec 17 '22 11:12

Mark Arnold


2 Answers

Remarkable speed improvements can be made with a dynamic regex. We can use a variable to store the max length string, then search for a string that is that long, plus one or more. The theory being that we only need to look for strings longer than the one we already have.

I used a solution that looks like this

sub hack {
    my $match = "";                        # original search string
    while ($string =~ /(${match}1+)/g) {   # search for $match plus 1 or more 1s
        $match = $1;                       # when found, change to new match
    }
    length $match;                         # return max length
}

And compared it to the original method described by the OP, with the following result

use strict;
use warnings;
use Benchmark ':all';

my $string = '0100100101111011010010101101101110101011111111101010100100100001011101010100' x 10_000;

cmpthese(-1, {
    org  => sub { my $max = 0; while ($string =~ /(1+)/g) { my $len = length($1); if ($max < $len) { $max = $len } } },
    hack => sub { my $match = ""; while ($string =~ /(${match}1+)/g) { $match = $1; } length $match }
});

Output:

       Rate    org   hack
org  7.31/s     --   -99%
hack 1372/s 18669%     --

Which seems astonishingly high, 19000% faster. It makes me think I've made a mistake, but I can't think what that would be. Maybe I am missing something in the regex machine internals, but this would be quite the improvement on the original solution.

like image 171
TLP Avatar answered Feb 05 '23 17:02

TLP


For short strings, the following is faster than all previously-presented solutions:

use List::Util qw( max );

max 0, map length, split /[^1]+/, $s
          Rate  hack  sort   org   max  mxsp    xs
hack   76879/s    --  -12%  -34%  -37%  -48%  -98%   <-- TLP
sort   87664/s   14%    --  -24%  -28%  -41%  -98%   <-- Jim Davis
org   115660/s   50%   32%    --   -6%  -22%  -98%   <-- OP
max   122504/s   59%   40%    6%    --  -17%  -98%   <-- Jim Davis
mxsp  147867/s   92%   69%   28%   21%    --  -97%   <-- ikegami (above)
xs   4950278/s 6339% 5547% 4180% 3941% 3248%    --   <-- ikegami (below)

Benchmark code:

use Benchmark qw( cmpthese );

my $string = ( '01001001011110110100101011011011101010'
             . '11111111101010100100100001011101010100' );

cmpthese(-3, {
    org  => sub { my $max = 0; while ($string =~ /(1+)/g) { my $len = length($1); if ($max < $len) { $max = $len; } } },
    hack => sub { my $match = ""; while ($string =~ /(${match}1+)/g) { $match = $1; } my $max = length($match); },
    sort => sub { my $max = ( sort { $b <=> $a } $string =~ /(1+)/g )[0]; },
    max  => sub { my $max = max 0, map length, $string =~ /(1+)/g; },
    mxsp => sub { my $max = max 0, map length, split /[^1]+/, $string; },
    xs   => sub { my $max = longuest_ones_count($string); },
});

That said, the fastest solution would involve XS. The following is my stab at it:

IV longuest_ones_count(SV* sv) {
   IV max = 0;
   IV count = 0;

   // This code works whether the string is upgraded or downgraded.
   STRLEN len;
   char *s = SvPV(sv, len);
   while (len--) {
      if (*(s++) == '1') {
         ++count;
      }
      else if (count) {
         if (max < count)
            max = count;

         count = 0;
      }
   }

   if (max < count)
      max = count;

   return max;
}

One way of using it:

use 5.014;
use warnings;

use Inline C => <<'__';

...above code here...

__


say "$_: ", longuest_ones_count($_)
   for qw(
      0
      11111
      011111
      111110
      01110111110
      01111101110
   );

You saw this beat the other solutions out of the water for short strings. But you don't have short strings. For long strings, this is not as fast as TLP's version!!!

Same benchmark as above but using

my $string = ( '01001001011110110100101011011011101010'
             . '11111111101010100100100001011101010100' ) x 10_000;
       Rate   sort    org    max   mxsp     xs   hack
sort 8.61/s     --   -25%   -31%   -44%   -99%   -99%
org  11.6/s    34%     --    -8%   -24%   -99%   -99%
max  12.5/s    46%     9%     --   -18%   -99%   -99%
mxsp 15.3/s    77%    32%    22%     --   -99%   -99%  <-- ikegami (Perl)
xs   1031/s 11870%  8822%  8118%  6653%     --   -25%  <-- ikegami (XS)
hack 1366/s 15772% 11731% 10797%  8855%    33%     --  <-- TLP

Wow the regex engine is good! It can obviously be beat using XS (by eliminating the time needed to compile the pattern), but what's the point?

like image 41
ikegami Avatar answered Feb 05 '23 17:02

ikegami