I have two strings, and I would like to test if they are anagrams of each other.
To test if string A is an anagram of string B, the characters of A and B are both sorted. If the resulting sorted strings match exactly, string A and string B are anagrams of each other.
I am split
ing the strings up into character arrays, using Perl's sort
routine, join
ing the characters back together, and testing for string equality with eq
:
sub anagram
{
my ($s1, $s2) = @_;
return (join '', sort { $a cmp $b } split(//, $s1)) eq
(join '', sort { $a cmp $b } split(//, $s2));
}
Is there a way to avoid having to convert between the scalar and array types (relying on join
and split
)? And if so, which method is more efficient?
Well, I've found a way that's over 30 times faster—though, arguably, its cheating. I've included the Benchmark.pm code to benchmark it, since you're apparently not familiar with it.
The benchmark is:
Rate Join Cheat
Join 83294/s -- -97%
Cheat 2580687/s 2998% --
And the code. After the third line, I think you'll understand why its arguably cheating:
use v5.14;
use Benchmark qw(cmpthese);
use Inline 'C';
sub an_join {
my ($s1, $s2) = @_;
return (join '', sort split(//, $s1)) eq
(join '', sort split(//, $s2));
}
use constant {
STR1 => 'abcdefghijklm',
STR2 => 'abcdefghijkmm',
STR3 => 'abcdefghijkml',
};
cmpthese(
0,
{
'Join' => 'an_join(STR1, STR2); an_join(STR1, STR3)',
'Cheat' => 'an_cheat(STR1, STR2); an_cheat(STR1, STR3)',
});
__END__
__C__
int an_cheat(const char *a, const char *b) {
unsigned char vec_a[26], vec_b[26];
const char *p, *end;
memset(vec_a, 0, sizeof(vec_a));
memset(vec_b, 0, sizeof(vec_b));
end = a+strlen(a);
for (p = a; p < end; ++p)
if (*p >= 'a' && *p <= 'z')
++vec_a[(*p)-'a'];
end = b+strlen(b);
for (p = b; p < end; ++p)
if (*p >= 'a' && *p <= 'z')
++vec_b[(*p)-'a'];
return 0 == memcmp(vec_a, vec_b, sizeof(vec_a));
}
Of course, its cheating because its not written in Perl—its in C. Also, it has limitations the Perl version doesn't (only works with lowercase ASCII characters being the most significant—it just ignores everything else). But if you really need speed, you can use cheating like this.
edit:
Extending to all of Latin1 (well, raw 8-bit characters, really). Also, I found that the compiler managed to optimize a simpler loop (without point arithmetic) better, and its easier to read too, so... Benchmark tells me that the lowercase-ASCII-only version is about 10% faster:
int an_cheat_l1b(const char *a, const char *b) {
unsigned char vec_a[UCHAR_MAX], vec_b[UCHAR_MAX];
size_t len, i;
memset(vec_a, 0, sizeof(vec_a));
memset(vec_b, 0, sizeof(vec_b));
len = strlen(a);
for (i = 0; i < len; ++i)
++vec_a[((const unsigned char *)(a))[i]];
len = strlen(b);
for (i = 0; i < len; ++i)
++vec_b[((const unsigned char *)(b))[i]];
return 0 == memcmp(vec_a, vec_b, sizeof(vec_a));
}
Note that the advantage of the C version grows as the string gets longer—which is expected, since its Θ(n) as opposed to the Perl versions O(n·logn). Also the penalty for full Latin1 decreases, meaning that penalty is probably the memcmp.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With