Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why isn't locale.strxfrm("Gè") a prefix of locale.strxfrm("Gène")) with locale "fr_FR.UTF-8"?

The code here is in Python, but the behavior should be the same in C/C++ using locale.

>>> import locale
>>> locale.setlocale(locale.LC_ALL, "fr_FR.UTF-8")
>>> locale.strxfrm("Gène").startswith(locale.strxfrm("Gè"))
False

I know it is not supposed to be used that way, but I'm wondering what is going on...

Context:
I have an array of strxfrm-transformed strings and an normal input text. I want to know which strxfrm-transformed strings started with text before transformation. Is it doable at all ? How ?

Bonus Question:

Can we get the per-locale list of equivalent letters ? Can we check for equivalent strings ?

What I mean is:
In "de_DE.UTF8", can I get something like

locale.strxfrm("Wissen").startswith(locale.strxfrm("Wiß")) 

returning True ?

Since "ß" and "ss" are equivalent in sorting (unless it's the only difference):

> locale.strxfrm("Wiessen") < locale.strxfrm("Wießen") < locale.strxfrm("Wiessen0")
True

Same for "œ" and "oe" in French.

EDIT: Regarding the bonus, I saw Python locale-aware string comparison but the answer relies on 3rd party libs, so I proposed a workaround hacked function :

def isEquivalent(str1, str2):
    return ( locale.strxfrm(str2[:-1]) < locale.strxfrm(str1) <= locale.strxfrm(str2) < locale.strxfrm(str1+"0") 
    or 
    locale.strxfrm(str1[:-1]) < locale.strxfrm(str2) <= locale.strxfrm(str1) < locale.strxfrm(str2+"0") )
like image 563
Bastien Jacquet Avatar asked Jan 13 '15 06:01

Bastien Jacquet


1 Answers

A very interesting question! This answer is not canonical, I think glibc-dev would be the best forum for that.

TL;DR

The only requirement for strxfrm is this:

strcmp(strxfrm(a), strxfrm(b)) == strcoll(a, b)

What strxfrm allows is to export the relative order of things to another (dumber) system, for example, to maintain a secondary index in a database table.

Let's test it

Let's examine Python3 (Python3.9, OSX, composed normal form):

>>> locale.strxfrm(unicodedata.normalize("NFC", "Gène"))
'Jëqh\x01Jëqh'
>>> locale.strxfrm(unicodedata.normalize("NFC", "Gè"))
'Jë\x01Jë'

If you were to break the output by the <SOH> byte, you'd actually get a valid substring.

I don't know the significance of the the output essentially repeated on both sides of the separator character. 🤔

Python 3 NFD appears to follow same semantics, but different output, which I guess only underlines how important it is to normalise your text 😼

>>> locale.strxfrm(unicodedata.normalize("NFD", "Gène"))
'Jhăqh\x01JhЃqh'
>>> locale.strxfrm(unicodedata.normalize("NFD", "Gè"))
'Jhă\x01JhЃ'

Other scripts have funkier output, here's Japanese in Japanese locale:

>>> locale.strxfrm(unicodedata.normalize("NFC", "村上  春樹"))
'ăă#ăă\x01桔伍#木欼'
>>> locale.strxfrm(unicodedata.normalize("NFC", "村上春樹"))
'ăăăă\x01桔伍木欼'
>>> locale.strxfrm(unicodedata.normalize("NFC", "村上"))
'ăă\x01桔伍'
>>> 'ăăăă\x01桔伍木欼' > 'ăă#ăă\x01桔伍#木欼' > 'ăă\x01桔伍'
True

Python2 has a different format where the content is also repeated, but it's unclear how to detect the separator. So, let's not use Python 2, it's already EOL 😅

>>> locale.strxfrm(unicodedata.normalize("NFC", u"Gène").encode("utf-8"))
'0019003Z001`001W00000019003Z001`001W'
>>> locale.strxfrm(unicodedata.normalize("NFC", u"Gè").encode("utf-8"))
'0019003Z00000019003Z'

JavaScript has the Intl module, which provides collation (ordering) via new Intl.Collator(...).compare() but as far as I know does not expose an equivalent of strxfrm. I wonder if there's some fundamental difficulty with that. I wish such function was available to build e.g. custom IndexedDB indices, but alas! 🤷‍♂️

like image 193
Dima Tisnek Avatar answered Sep 22 '22 05:09

Dima Tisnek