Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is Perl's "standard string comparison order"?

This is really a double question, my two end goals having answers to:

  • What is the standard string comparison order, in terms of the mechanics?
  • What's a better name for that so I can update the docs?

Perl's documentation for sort says that without a block, sort uses "standard string comparison order". But what is that order? There should be a better name for it. For this question, I specifically mean the situation where locale is not in effect, since that defines its own order.

In years past, we normally called the standard sort order "ASCIIbetically". It's in Learning Perl and many other books. However, that term is dated. Perl has been Unicode-aware since 5.6. Talking about ASCII is old school. Since Perl is also Unicode-aware, it knows about character strings. In sv.c, Perl_sv_cmp knows about locale, bytes, and UTF-8. The first two are easy. But I'm not confident about the third.

/*
=for apidoc sv_cmp

Compares the strings in two SVs.  Returns -1, 0, or 1 indicating whether the
string in C<sv1> is less than, equal to, or greater than the string in
C<sv2>. Is UTF-8 and 'use bytes' aware, handles get magic, and will
coerce its args to strings if necessary.  See also C<sv_cmp_locale>.

=cut
*/

When Perl sorts using UTF-8, what is it really sorting? The bytes the string encodes to, the characters it represents (including marks maybe?), or something else? I think this is the relevant line in sv.c (line 6698 for commit 7844ec1):

 pv1 = tpv = (char*)bytes_to_utf8((const U8*)pv1, &cur1);

If I'm reading that right (using my rusty C), pv1 is coerced to octets, turned into UTF-8, then coerced to characters (in the C sense). I think that means it's sorting by the UTF-8 encoding (i.e. the actual bytes that UTF-8 uses to represent a code point). Another way to say that is that it doesn't sort on graphemes. I think I've almost convinced myself I'm reading this right, but some of you know way more about this than I do.

From that, the next interesting line is 6708:

 const I32 retval = memcmp((const void*)pv1, (const void*)pv2, cur1 < cur2 ? cur1 : cur2);

To me that looks like once it has pv1 and pv2, which were coerced to char *, now are just compared byte-by-byte because they are coerced to void *. Is that what happens with memcmp, which looks like it's just comparing bits based on the various docs I've read so far? Again, I'm wondering what I'm missing in the trip from bytes->utf8->char->bytes, like maybe a Unicode normalization step. Checking out Perl_bytes_to_utf8 in utf8.c didn't help me answer that question.

As a side note, I'm wondering if this is the same thing as the Unicode Collation Algorithm? If it is, why does Unicode::Collate exist? From the looks of it, I don't think Perl's sort handles canonical equivalence.

like image 603
brian d foy Avatar asked Nov 04 '09 22:11

brian d foy


2 Answers

UTF-8 has the property that sorting a UTF-8 string byte-by-byte according to the byte value gives the same ordering as sorting it codepoint-by-codepoint according to the codepoint number. That is, I know without looking that the UTF-8 representation of U+2345 is lexicographically after the UTF-8 representation of U+1234.

As for normalization, the Perl core doesn't know anything about it; to get accurate sorting and comparison among the different forms you would want to run all of your strings through Unicode::Normalize and convert them all to the same normalization form. I can't comment on which is best for any given purpose, mostly because I have no clue.

Also, sorting and cmp are affected by the locale pragma if it's in use; it uses the POSIX collation order. Using use locale, an 8-bit locale, and unicode all together is a recipe for disaster, but using use locale, a UTF-8 locale, and unicode should work usefully. I can't say I've tried it. There's an awful lot of info in perllocale and perlunicode anyway.

like image 86
hobbs Avatar answered Nov 18 '22 13:11

hobbs


I can't answer the whole question, so let me hone in on one part:

    const I32 retval = memcmp((const void*)pv1, (const void*)pv2, cur1 < cur2 ? cur1 : cur2);

... looks like once it has pv1 and pv2, which were coerced to char *, now are just compared byte-by-byte because they are coerced to void *. Is that what happens with memcmp

Pretty much. The main differences differences between memcmp and strcmp are:

  1. strcmp will stop once it sees a NULL (i.e., '\0'), and Perl allows scalars to have embedded NULLs
  2. memcmp often runs just a little bit faster than strcmp

But aside from that you're going to get the same results.

like image 24
Max Lybbert Avatar answered Nov 18 '22 14:11

Max Lybbert