I have encountered some strange Perl behavior: using a Posix character class in a regexp completely alters the sort order for the resulting strings.
Here is my test program:
sub namecmp($a,$b) {
$a=~/([:alpha:]*)/;
# $a=~/([a-z]*)/;
$aword= $1;
$b=~/([:alpha:]*)/;
# $b=~/([a-z]*)/;
$bword= $1;
return $aword cmp $bword;
};
$_= <>;
@names= sort namecmp split;
print join(" ", @names), "\n";
If you change to the commented-out regexp's using [a-z], you get the normal, lexicographic sort order. However, the Posix [:alpha:] character class yields some weird-ass sort order, as follows:
$test_normal
aaa aab aac aba abb abc aca acb acc baa bab bac bba bbb bbc bca bcb bcc caa cbb
aaa aab aac aba abb abc aca acb acc baa bab bac bba bbb bbc bca bcb bcc caa cbb
$test_posix
aaa aab aac aba abb abc aca acb acc baa bab bac bba bbb bbc bca bcb bcc caa cbb
baa bab bac bba bbb bbc bca bcb bcc caa cbb aba abb abc aca acb acc aab aac aaa
My best guess is that the Posix character class is activating some kind of locale stuff I've never heard of and didn't ask for. I suppose the logical reaction to "doctor, doctor, it hurts when I do this!" is, "well, don't do that, then!".
But, can anyone tell me what's happening here, and why? I'm using perl 5.10, but I believe it also works under perl 5.8.
[:digit:] is a POSIX character class, used inside a bracket expression like [x-z[:digit:]]. The POSIX character class names must be written all lowercase. When used on ASCII strings, these two regular expressions find exactly the same matches: a single character that is either x, y, z, or a digit.
The POSIX Basic Regular Expression (BRE) syntax provided extensions to achieve consistency between utility programs such as grep, sed and awk. These extensions are not supported by some traditional implementations of Unix tools.
The character class [:alpha:]
represents alpha characters in Perl regular expressions, but the square brackets do not mean what they normally do in regular expressions. So you need:
$a=~/([[:alpha:]]*)/;
This is mentioned in perlre:
The POSIX character class syntax
[:class:]
is also available. Note that the
[
and]
brackets are literal; they must always be used within a character class expression.
# this is correct:
$string =~ /[[:alpha:]]/;
# this is not, and will generate a warning:
$string =~ /[:alpha:]/;
What you are writing is not Perl by any stretch of the imagination. You are able to get away with it because you have turned off warnings
. If you had used warnings, perl
would have told you
POSIX syntax [: :] belongs inside character classes in regex; marked by <-- HERE
in m/([:alpha:] <-- HERE *)/ at j.pl line 4.
POSIX syntax [: :] belongs inside character classes in regex; marked by <-- HERE
in m/([:alpha:] <-- HERE *)/ at j.pl line 8.
Imagine that!
Now, perl
would have also told you:
Illegal character in prototype for main::namecmp : $a,$b at j.pl line 3.
because, Perl is not C. Perl does not have function prototypes of the sort you seem to be trying to use.
A better way of writing the exact same functionality, in Perl this time, is:
use warnings; use strict;
sub namecmp {
my ($aword) = $a =~ /([[:alpha:]]*)/;
my ($bword) = $b =~ /([[:alpha:]]*)/;
return $aword cmp $bword;
}
print join(' ', sort namecmp split ' ', scalar <>), "\n";
Because Perl doesn't support POSIX character classes in this form. (Use [[:alpha:]]
. See @Greg's answer)
So
[:alpha:]
is interpreted as a character class consisting of the characters "a
", "h
", "l
", "p
" and ":
".
Now, for strings that do nothing contain [ahlp:]
at the beginning (because of the *
), e.g. "baa
" the match will return an empty string. An empty string of course is of course smaller than any other strings, so they will be arranged at the beginning.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With