Unicode character not in range when calling locale.strxfrm

Tags:

I am experiencing an odd behavior when using the locale library with unicode input. Below is a minimum working example:

>>> x = '\U0010fefd'
>>> ord(x)
1113853
>>> ord('\U0010fefd') == 0X10fefd
True
>>> ord(x) <= 0X10ffff
True
>>> import locale
>>> locale.strxfrm(x)
'\U0010fefd'
>>> locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
'en_US.UTF-8'
>>> locale.strxfrm(x)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: character U+110000 is not in range [U+0000; U+10ffff]

I have seen this on Python 3.3, 3.4 and 3.5. I do not get an error on Python 2.7.

As far as I can see, my unicode input is within the appropriate unicode range, so it seems that somehow something internal to strxfrm when using the 'en_US.UTF-8' is moving the input out of range.

I am running Mac OS X, and this behavior may be related to http://bugs.python.org/issue23195... but I was under the impression this bug would only manifest as incorrect results, not a raised exception. I cannot replicate on my SLES 11 machine, and others confirm they cannot replicate on Ubuntu, Centos, or Windows. It may be instructive to hear about other OS's in the comments.

Can someone explain what may be happening here under the hood?

494

asked Nov 01 '15 05:11

SethMMorton

1 Answers

In Python 3.x, the function locale.strxfrm(s) internally uses the POSIX C function wcsxfrm(), which is based on current LC_COLLATE setting. The POSIX standard define the transformation in this way:

The transformation shall be such that if wcscmp() is applied to two transformed wide strings, it shall return a value greater than, equal to, or less than 0, corresponding to the result of wcscoll() applied to the same two original wide-character strings.

This definition can be implemented in multiple ways, and doesn't even require that the resulting string is readable.

I've created a little C code example to demonstrate how it works:

#include <stdio.h>
#include <wchar.h>
#include <locale.h>

int main() {
  wchar_t buf[10];
  wchar_t *in = L"\x10fefd";
  int i;

  setlocale(LC_COLLATE, "en_US.UTF-8");

  printf("in : ");
  for(i=0;i<10 && in[i];i++)
    printf(" 0x%x", in[i]);
  printf("\n");

  i = wcsxfrm(buf, in, 10);

  printf("out: ");
  for(i=0;i<10 && buf[i];i++)
    printf(" 0x%x", buf[i]);
  printf("\n");
}

It prints the string before and after the transformation.

Running it on Linux (Debian Jessie) this is the result:

in : 0x10fefd
out: 0x1 0x1 0x1 0x1 0x552

while running it on OSX (10.11.1) the result is:

in : 0x10fefd
out: 0x103 0x1 0x110000

You can see that the output of wcsxfrm() on OSX contains the character U+110000 which is not permitted in a Python string, so this is the source of the error.

On Python 2.7 the error is not raised because its locale.strxfrm() implementation is based on strxfrm() C function.

UPDATE:

Investigating further, I see that the LC_COLLATE definition for en_US.UTF-8 on OSX is a link to la_LN.US-ASCII definition.

$ ls -l /usr/share/locale/en_US.UTF-8/LC_COLLATE
lrwxr-xr-x 1 root wheel 28 Oct  1 14:24 /usr/share/locale/en_US.UTF-8/LC_COLLATE -> ../la_LN.US-ASCII/LC_COLLATE

I found the actual definition in the sources from Apple. The content of file la_LN.US-ASCII.src is the following:

order \
    \x00;...;\xff

2nd UPDATE:

I've further tested the wcsxfrm() function on OSX. Using the la_LN.US-ASCII collate, given a sequence of wide character C1..Cn as input, the output is a string with this form:

W1..Wn \x01 U1..Un

where

Wx = 0x103 if Cx > 0xFF else Cx+0x3
Ux = Cx+0x103 if Cx > 0xFF else Cx+0x3

Using this algorithm \x10fefd become 0x103 0x1 0x110000

I've checked and every UTF-8 locale use this collate on OSX, so I'm inclined to say that the collate support for UTF-8 on Apple systems is broken. The resulting ordering is almost the same of the one obtained whith normal byte comparison, with the bonus of the ability to obtain illegal Unicode characters.

114

answered Oct 16 '22 14:10

mnencia

Related questions
                            
                                Reply to email using python 3.4
                            
                                Where do prints go when running Flask with Apache?
                            
                                Why don't cython compile logic or to `||` expression?
                            
                                How to make "Copy to clipboard" button/link in django admin for selected field?
                            
                                How to trigger Python script on Raspberry Pi from Node-Red
                            
                                Python Scipy: scipy.stats.spearmanr returning nans
                            
                                Uninstall and re-install pip package from python module
                            
                                How to connect to remote machine via WinRM in Python (pywinrm) using domain account?
                            
                                Select batch of rows sqlalchemy mysql
                            
                                Return std and confidence intervals for out-of-sample prediction in StatsModels
                            
                                from matplotlib import style ImportError: cannot import name 'style'
                            
                                python map exception continue mapping execution
                            
                                ipython on MacOS 10.10 - command not found
                            
                                Python, Matplotlib, Scatter plot, Change color on the clicked point
                            
                                In the Django Admin Site, how can I access model properties through an Inline?
                            
                                Encoding error using df.to_csv()
                            
                                What is the equivalent to scala.util.Try in pyspark?
                            
                                WTForms SelectField not properly coercing for booleans
                            
                                Extracting whole words based on substring matching in python
                            
                                Oracle 11g - query appears to cache even with NOCACHE hint

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Unicode character not in range when calling locale.strxfrm

Tags:

python

python-3.x

unicode

locale

SethMMorton

People also ask

1 Answers

mnencia

Recent Activity

Donate For Us