The code here is in Python, but the behavior should be the same in C/C++ using locale.
>>> import locale
>>> locale.setlocale(locale.LC_ALL, "fr_FR.UTF-8")
>>> locale.strxfrm("Gène").startswith(locale.strxfrm("Gè"))
False
I know it is not supposed to be used that way, but I'm wondering what is going on...
Context:
I have an array of strxfrm-transformed strings and an normal input text. I want to know which strxfrm-transformed strings started with text before transformation. Is it doable at all ? How ?
Bonus Question:
Can we get the per-locale list of equivalent letters ? Can we check for equivalent strings ?
What I mean is:
In "de_DE.UTF8"
, can I get something like
locale.strxfrm("Wissen").startswith(locale.strxfrm("Wiß"))
returning True ?
Since "ß" and "ss" are equivalent
in sorting (unless it's the only difference):
> locale.strxfrm("Wiessen") < locale.strxfrm("Wießen") < locale.strxfrm("Wiessen0")
True
Same for "œ" and "oe" in French.
EDIT: Regarding the bonus, I saw Python locale-aware string comparison but the answer relies on 3rd party libs, so I proposed a workaround hacked function :
def isEquivalent(str1, str2):
return ( locale.strxfrm(str2[:-1]) < locale.strxfrm(str1) <= locale.strxfrm(str2) < locale.strxfrm(str1+"0")
or
locale.strxfrm(str1[:-1]) < locale.strxfrm(str2) <= locale.strxfrm(str1) < locale.strxfrm(str2+"0") )
A very interesting question!
This answer is not canonical, I think glibc-dev
would be the best forum for that.
The only requirement for strxfrm
is this:
strcmp(strxfrm(a), strxfrm(b)) == strcoll(a, b)
What strxfrm
allows is to export the relative order of things to another (dumber) system, for example, to maintain a secondary index in a database table.
Let's examine Python3 (Python3.9, OSX, composed normal form):
>>> locale.strxfrm(unicodedata.normalize("NFC", "Gène"))
'Jëqh\x01Jëqh'
>>> locale.strxfrm(unicodedata.normalize("NFC", "Gè"))
'Jë\x01Jë'
If you were to break the output by the <SOH>
byte, you'd actually get a valid substring.
I don't know the significance of the the output essentially repeated on both sides of the separator character. 🤔
Python 3 NFD appears to follow same semantics, but different output, which I guess only underlines how important it is to normalise your text 😼
>>> locale.strxfrm(unicodedata.normalize("NFD", "Gène"))
'Jhăqh\x01JhЃqh'
>>> locale.strxfrm(unicodedata.normalize("NFD", "Gè"))
'Jhă\x01JhЃ'
Other scripts have funkier output, here's Japanese in Japanese locale:
>>> locale.strxfrm(unicodedata.normalize("NFC", "村上 春樹"))
'ăă#ăă\x01桔伍#木欼'
>>> locale.strxfrm(unicodedata.normalize("NFC", "村上春樹"))
'ăăăă\x01桔伍木欼'
>>> locale.strxfrm(unicodedata.normalize("NFC", "村上"))
'ăă\x01桔伍'
>>> 'ăăăă\x01桔伍木欼' > 'ăă#ăă\x01桔伍#木欼' > 'ăă\x01桔伍'
True
Python2 has a different format where the content is also repeated, but it's unclear how to detect the separator. So, let's not use Python 2, it's already EOL 😅
>>> locale.strxfrm(unicodedata.normalize("NFC", u"Gène").encode("utf-8"))
'0019003Z001`001W00000019003Z001`001W'
>>> locale.strxfrm(unicodedata.normalize("NFC", u"Gè").encode("utf-8"))
'0019003Z00000019003Z'
JavaScript has the Intl
module, which provides collation (ordering) via new Intl.Collator(...).compare()
but as far as I know does not expose an equivalent of strxfrm
. I wonder if there's some fundamental difficulty with that. I wish such function was available to build e.g. custom IndexedDB indices, but alas! 🤷♂️
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With