Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is using a POSIX character class in my regex pattern giving unexpected results?

I have encountered some strange Perl behavior: using a Posix character class in a regexp completely alters the sort order for the resulting strings.

Here is my test program:

sub namecmp($a,$b) {
  $a=~/([:alpha:]*)/;
  # $a=~/([a-z]*)/;
  $aword= $1;

  $b=~/([:alpha:]*)/;
  # $b=~/([a-z]*)/;
  $bword= $1;
  return $aword cmp $bword;
};

$_= <>;
@names= sort namecmp split;
print join(" ", @names), "\n";

If you change to the commented-out regexp's using [a-z], you get the normal, lexicographic sort order. However, the Posix [:alpha:] character class yields some weird-ass sort order, as follows:

$test_normal
aaa aab aac aba abb abc aca acb acc baa bab bac bba bbb bbc bca bcb bcc caa cbb
aaa aab aac aba abb abc aca acb acc baa bab bac bba bbb bbc bca bcb bcc caa cbb

$test_posix
aaa aab aac aba abb abc aca acb acc baa bab bac bba bbb bbc bca bcb bcc caa cbb
baa bab bac bba bbb bbc bca bcb bcc caa cbb aba abb abc aca acb acc aab aac aaa

My best guess is that the Posix character class is activating some kind of locale stuff I've never heard of and didn't ask for. I suppose the logical reaction to "doctor, doctor, it hurts when I do this!" is, "well, don't do that, then!".

But, can anyone tell me what's happening here, and why? I'm using perl 5.10, but I believe it also works under perl 5.8.

like image 550
comingstorm Avatar asked Feb 25 '10 09:02

comingstorm


People also ask

What is POSIX character class?

[:digit:] is a POSIX character class, used inside a bracket expression like [x-z[:digit:]]. The POSIX character class names must be written all lowercase. When used on ASCII strings, these two regular expressions find exactly the same matches: a single character that is either x, y, z, or a digit.

What is Posix regular expression?

The POSIX Basic Regular Expression (BRE) syntax provided extensions to achieve consistency between utility programs such as grep, sed and awk. These extensions are not supported by some traditional implementations of Unix tools.


3 Answers

The character class [:alpha:] represents alpha characters in Perl regular expressions, but the square brackets do not mean what they normally do in regular expressions. So you need:

$a=~/([[:alpha:]]*)/;

This is mentioned in perlre:

The POSIX character class syntax

[:class:]

is also available. Note that the [ and ] brackets are literal; they must always be used within a character class expression.

# this is correct:
$string =~ /[[:alpha:]]/;

# this is not, and will generate a warning:
$string =~ /[:alpha:]/;
like image 63
Greg Hewgill Avatar answered Oct 27 '22 09:10

Greg Hewgill


What you are writing is not Perl by any stretch of the imagination. You are able to get away with it because you have turned off warnings. If you had used warnings, perl would have told you

POSIX syntax [: :] belongs inside character classes in regex; marked by <-- HERE in m/([:alpha:] <-- HERE *)/ at j.pl line 4.

POSIX syntax [: :] belongs inside character classes in regex; marked by <-- HERE in m/([:alpha:] <-- HERE *)/ at j.pl line 8.

Imagine that!

Now, perl would have also told you:

Illegal character in prototype for main::namecmp : $a,$b at j.pl line 3.

because, Perl is not C. Perl does not have function prototypes of the sort you seem to be trying to use.

A better way of writing the exact same functionality, in Perl this time, is:

use warnings; use strict;

sub namecmp {
    my ($aword) = $a =~ /([[:alpha:]]*)/;
    my ($bword) = $b =~ /([[:alpha:]]*)/;
    return $aword cmp $bword;
}

print join(' ', sort namecmp split ' ', scalar <>), "\n";
like image 28
Sinan Ünür Avatar answered Oct 27 '22 10:10

Sinan Ünür


Because Perl doesn't support POSIX character classes in this form. (Use [[:alpha:]]. See @Greg's answer)

So

[:alpha:]

is interpreted as a character class consisting of the characters "a", "h", "l", "p" and ":".

Now, for strings that do nothing contain [ahlp:] at the beginning (because of the *), e.g. "baa" the match will return an empty string. An empty string of course is of course smaller than any other strings, so they will be arranged at the beginning.

like image 23
kennytm Avatar answered Oct 27 '22 11:10

kennytm