Why is using a POSIX character class in my regex pattern giving unexpected results?

Tags:

I have encountered some strange Perl behavior: using a Posix character class in a regexp completely alters the sort order for the resulting strings.

Here is my test program:

sub namecmp($a,$b) {
  $a=~/([:alpha:]*)/;
  # $a=~/([a-z]*)/;
  $aword= $1;

  $b=~/([:alpha:]*)/;
  # $b=~/([a-z]*)/;
  $bword= $1;
  return $aword cmp $bword;
};

$_= <>;
@names= sort namecmp split;
print join(" ", @names), "\n";

If you change to the commented-out regexp's using [a-z], you get the normal, lexicographic sort order. However, the Posix [:alpha:] character class yields some weird-ass sort order, as follows:

$test_normal
aaa aab aac aba abb abc aca acb acc baa bab bac bba bbb bbc bca bcb bcc caa cbb
aaa aab aac aba abb abc aca acb acc baa bab bac bba bbb bbc bca bcb bcc caa cbb

$test_posix
aaa aab aac aba abb abc aca acb acc baa bab bac bba bbb bbc bca bcb bcc caa cbb
baa bab bac bba bbb bbc bca bcb bcc caa cbb aba abb abc aca acb acc aab aac aaa

My best guess is that the Posix character class is activating some kind of locale stuff I've never heard of and didn't ask for. I suppose the logical reaction to "doctor, doctor, it hurts when I do this!" is, "well, don't do that, then!".

But, can anyone tell me what's happening here, and why? I'm using perl 5.10, but I believe it also works under perl 5.8.

550

asked Feb 25 '10 09:02

comingstorm

3 Answers

The character class [:alpha:] represents alpha characters in Perl regular expressions, but the square brackets do not mean what they normally do in regular expressions. So you need:

$a=~/([[:alpha:]]*)/;

This is mentioned in perlre:

The POSIX character class syntax
[:class:]
is also available. Note that the [ and ] brackets are literal; they must always be used within a character class expression.

# this is correct:
$string =~ /[[:alpha:]]/;

# this is not, and will generate a warning:
$string =~ /[:alpha:]/;

answered Oct 27 '22 09:10

Greg Hewgill

What you are writing is not Perl by any stretch of the imagination. You are able to get away with it because you have turned off warnings. If you had used warnings, perl would have told you

POSIX syntax [: :] belongs inside character classes in regex; marked by <-- HERE in m/([:alpha:] <-- HERE *)/ at j.pl line 4.

POSIX syntax [: :] belongs inside character classes in regex; marked by <-- HERE in m/([:alpha:] <-- HERE *)/ at j.pl line 8.

Imagine that!

Now, perl would have also told you:

Illegal character in prototype for main::namecmp : $a,$b at j.pl line 3.

because, Perl is not C. Perl does not have function prototypes of the sort you seem to be trying to use.

A better way of writing the exact same functionality, in Perl this time, is:

use warnings; use strict;

sub namecmp {
    my ($aword) = $a =~ /([[:alpha:]]*)/;
    my ($bword) = $b =~ /([[:alpha:]]*)/;
    return $aword cmp $bword;
}

print join(' ', sort namecmp split ' ', scalar <>), "\n";

answered Oct 27 '22 10:10

Sinan Ünür

Because Perl doesn't support POSIX character classes in this form. (Use [[:alpha:]]. See @Greg's answer)

[:alpha:]

is interpreted as a character class consisting of the characters "a", "h", "l", "p" and ":".

Now, for strings that do nothing contain [ahlp:] at the beginning (because of the *), e.g. "baa" the match will return an empty string. An empty string of course is of course smaller than any other strings, so they will be arranged at the beginning.

answered Oct 27 '22 11:10

kennytm

Related questions
                            
                                Syntax error in IE using ES6 arrow functions
                            
                                Parsing Large Text Files in Real-time (Java)
                            
                                Optional characters in a regex
                            
                                Updating email addresses in MySQL (regexp?)
                            
                                Stripping HTML Comments With PHP But Leaving Conditionals
                            
                                PHP Regular Expression to remove all characters other than digits and periods
                            
                                Get version number from String in Javascript?
                            
                                regex: contains at least 8 decimal digits
                            
                                How to match and replace templating tags in Ruby / Rails?
                            
                                meaning of print results of qr in perl
                            
                                Ant regex compare in if condition
                            
                                How to remove duplicate break lines with PHP
                            
                                How to detect iOS 6 and all minor versions by user agent?
                            
                                Python regular expression that matches floating point numbers [duplicate]
                            
                                Get all variables used in a twig template file [duplicate]
                            
                                Remove All Unnecessary Whitespaces from JSON String (in PHP)
                            
                                remove leading zeroes from timestamp %j%Y %H:%M
                            
                                Extract email address from string using tsql
                            
                                Remove Emoji's from multilingual Unicode text
                            
                                Can Regex be used for this particular string manipulation?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is using a POSIX character class in my regex pattern giving unexpected results?

Tags:

regex

sorting

perl