Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

String ordering in Lua

I'm reading Programming in Lua, 1st edition (yup, I know it's a bit outdated), and in the section 3.2 (about relational operators), the author says:

For instance, with the European Latin-1 locale, we have "acai" < "açaí" < "acorde".

I don't get it. For me, it's OK to have "acai" < "açaí", but why is "açaí" < "acorde"?

AFAIK (and wikipedia seems to confirm), "c" < "ç", or am I wrong?

like image 485
Valdir Stumm Junior Avatar asked Apr 14 '14 12:04

Valdir Stumm Junior


People also ask

Are Lua arrays ordered?

Note that, for Lua, arrays also have no order.

How do you use strings in Lua?

You can initialize strings in Lua in three ways: Use single quotes. Use double quotes. Enclose text between [[ and ]]

How do I compare strings in Lua?

In lua '==' for string will return true if contents of the strings are equal. As it was pointed out in the comments, lua strings are interned, which means that any two strings that have the same value are actually the same string.

How do you sort a Lua table?

One of the most used functions in Lua is the sort function which is provided by the Lua library which tables a table as an argument and sorts the values that are present inside the table. The sort function also takes one more argument with the table and that argument is a function which is known as the order function.


2 Answers

In the third edition of PiL, this statement has been modified:

For instance, with a Portuguese Latin-1 locale, we have "acai"<"açaí"<"acorde".

So the locale needs to be set to Portuguese Latin-1 accordingly:

print("acai" < "açaí")
print("açaí" < "acorde")

print(os.setlocale("pt_PT"))

print("acai" < "açaí")
print("açaí" < "acorde")

On ideone, the result is:

true
false
pt_PT.iso88591
false
true

But the order of "acai" and "açaí" seems to be different from the book now.

like image 101
Yu Hao Avatar answered Sep 23 '22 21:09

Yu Hao


You reference a code page, which maps codepoints to characters. Certainly codepoints, being a finite set of non-negative integers, are well-ordered, distinct entities. However, that is not what characters are about.

Characters have a collation order, which is a partial ordering: Characters can be "equal" but not the same. Collation is a user-valued concept that varies by locale (and over time).

Strings are even more complicated because some character sets (e.g. Unicode) can have combining characters. That allows a "character" to be represented as a single character or as a base character followed by the combining characters. For example, "ä" vs "a¨". Since they represent the same conceptual character they should be considered even more equal than "ä" vs "a".

In Spanish, "ch", "rr" and "ll" used to be letters in the alphabet and words were ordered accordingly; Now, they are not but "ñ" still is.

Similarly, in the past it was not uncommon for English-speakers to sort surnames beginning with "Mc" and "Mac" after others beginning with "M".

Software libraries have to deal with such things because that's what users want. Thankfully, some of the older conventions have fallen from use.


So, a locale could very well have collation rules that result in "acai" < "açaí" < "acorde" if "c" has the same sort order as "ç" but "i" comes before "í". This case seems strange though the possibility in general requires our code to allow it.

like image 20
Tom Blodget Avatar answered Sep 25 '22 21:09

Tom Blodget