Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Perl: Sort characters within a string

I have two strings, and I would like to test if they are anagrams of each other.

To test if string A is an anagram of string B, the characters of A and B are both sorted. If the resulting sorted strings match exactly, string A and string B are anagrams of each other.

I am spliting the strings up into character arrays, using Perl's sort routine, joining the characters back together, and testing for string equality with eq:

sub anagram
{
  my ($s1, $s2) = @_;

  return (join '', sort { $a cmp $b } split(//, $s1)) eq
         (join '', sort { $a cmp $b } split(//, $s2));
}

Is there a way to avoid having to convert between the scalar and array types (relying on join and split)? And if so, which method is more efficient?

like image 430
ardnew Avatar asked Jan 12 '12 19:01

ardnew


1 Answers

Well, I've found a way that's over 30 times faster—though, arguably, its cheating. I've included the Benchmark.pm code to benchmark it, since you're apparently not familiar with it.

The benchmark is:

           Rate  Join Cheat
Join    83294/s    --  -97%
Cheat 2580687/s 2998%    --

And the code. After the third line, I think you'll understand why its arguably cheating:

use v5.14;
use Benchmark qw(cmpthese);
use Inline 'C';

sub an_join {
    my ($s1, $s2) = @_;
    return (join '', sort split(//, $s1)) eq
        (join '', sort split(//, $s2));
}

use constant {
    STR1 => 'abcdefghijklm',
    STR2 => 'abcdefghijkmm',
    STR3 => 'abcdefghijkml',
};

cmpthese(
    0,
    {
        'Join'  => 'an_join(STR1, STR2);  an_join(STR1, STR3)',
        'Cheat' => 'an_cheat(STR1, STR2); an_cheat(STR1, STR3)',
    });

__END__
__C__

int an_cheat(const char *a, const char *b) {
    unsigned char vec_a[26], vec_b[26];
    const char *p, *end;

    memset(vec_a, 0, sizeof(vec_a));
    memset(vec_b, 0, sizeof(vec_b));

    end = a+strlen(a);
    for (p = a; p < end; ++p)
        if (*p >= 'a' && *p <= 'z')
            ++vec_a[(*p)-'a'];
    end = b+strlen(b);
    for (p = b; p < end; ++p)
        if (*p >= 'a' && *p <= 'z')
            ++vec_b[(*p)-'a'];

    return 0 == memcmp(vec_a, vec_b, sizeof(vec_a));
}

Of course, its cheating because its not written in Perl—its in C. Also, it has limitations the Perl version doesn't (only works with lowercase ASCII characters being the most significant—it just ignores everything else). But if you really need speed, you can use cheating like this.

edit:

Extending to all of Latin1 (well, raw 8-bit characters, really). Also, I found that the compiler managed to optimize a simpler loop (without point arithmetic) better, and its easier to read too, so... Benchmark tells me that the lowercase-ASCII-only version is about 10% faster:

int an_cheat_l1b(const char *a, const char *b) {
    unsigned char vec_a[UCHAR_MAX], vec_b[UCHAR_MAX];
    size_t len, i;

    memset(vec_a, 0, sizeof(vec_a));
    memset(vec_b, 0, sizeof(vec_b));

    len = strlen(a);
    for (i = 0; i < len; ++i)
        ++vec_a[((const unsigned char *)(a))[i]];
    len = strlen(b);
    for (i = 0; i < len; ++i)
        ++vec_b[((const unsigned char *)(b))[i]];

    return 0 == memcmp(vec_a, vec_b, sizeof(vec_a));
}

Note that the advantage of the C version grows as the string gets longer—which is expected, since its Θ(n) as opposed to the Perl versions O(n·logn). Also the penalty for full Latin1 decreases, meaning that penalty is probably the memcmp.

like image 51
derobert Avatar answered Sep 24 '22 17:09

derobert