Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python 3.3 C-API and UTF-8 Strings

So, Python 3.3 introduced PEP 393, which changes the implementation of Python str objects (Unicode objects) so that internally, they are represented in a memory-efficient way while still allowing for random access to each character. This is done by scanning the string for the largest Unicode code point, and then choosing a Unicode encoding based on the size of the largest code point. This way ASCII strings only require 1 byte per character, but strings with say, East Asian (Chinese/Japanese/Korean) code points will require 4 bytes per character. This is much more efficient, memory-wise, than earlier implementations, which always used 2 or 4 bytes per character regardless of the actual code points in the string.

So I understand the purpose of PEP 393, but I'm really confused about how I'm supposed to go about creating a Python Unicode object from a UTF-8 encoded C string. Creating a Python str from a UTF-8 C string is an extremely common requirement. In fact, it's so common that the old (pre Python 3.3) C-API had a function entirely for that purpose:

PyObject* PyUnicode_FromStringAndSize(const char *u, Py_ssize_t size)

This function takes a UTF-8 C-string along with a size parameter, and creates a Python Unicode object from that string.

However, this function is now deprecated after PEP 393. But I'm still not really sure exactly what is supposed to replace it.

At first I thought we could use the new PyUnicode_FromKindAndData function:

PyObject* PyUnicode_FromKindAndData(int kind, const void *buffer, Py_ssize_t size)

This new function takes a "kind" parameter, which indicates whether the input buffer is UCS1, UCS2, or UCS4. Based on that parameter it creates a new compact Python Unicode object. So, at first I thought that saying PyUnicode_FromKindAndData(PyUnicode_1BYTE_KIND, buf, size) would basically be equivalent to the old PyUnicode_FromStringAndSize. I was thinking that PyUnicode_1BYTE_KIND means that Python should assume that the input buffer is a UTF-8 string. But this doesn't seem to be the case. UCS1 is not the same as UTF-8, and PyUnicode_1BYTE_KIND seems to just indicate that the input buffer has 1 byte per character - which is not the same as UTF-8, which is variable length and can have anywhere from 1 to 4 bytes per character.

So then, how can we create a Python Unicode object from a UTF-8 C string using the new PEP 393 API?

After reading the docs and PEP 393 itself, it seems to me that the only way to do this is to manually compute the maximum character yourself, and then call PyUnicode_New. Then iterate over the newly created string buffer, and manually convert each code point in the UTF-8 C string into the correct encoding based on the maxchar, copying each character using PyUnicode_WRITE in a loop.


... except I'm a bit surprised that the API would actually require all of this manual work - including all the conversions from UTF-8 to UTF-32 or UTF-16, taking into account things like surrogate pairs and all that. Basically doing all these conversions manually is a lot of effort, and I'm surprised the Python C-API doesn't expose functions to do this in an easier way. I mean, obviously such code exists in the Python source, since the old deprecated PyUnicode_FromStringAndSize was doing exactly that. It was converting UTF-8 to UTF-16 or UTF-32 (depending on the platform). But now with PEP 393, it seems all of this has to be done manually.

So am I missing something? Is there an easier way to create a Python Unicode object using a UTF-8 C string as input? Or is it really necessary to do this all manually if we wish to avoid using the deprecated functions?

like image 780
Siler Avatar asked Mar 19 '16 07:03

Siler


People also ask

What does encoding =' UTF-8 do in Python?

UTF-8 is a byte oriented encoding. The encoding specifies that each character is represented by a specific sequence of one or more bytes.

Does Python 3 use ASCII or Unicode?

In Python 3, all strings are sequences of Unicode characters. There is a bytes type that holds raw bytes. This does not distinguish "Unicode or ASCII"; it only distinguishes Python types.

Are Python strings UTF-8?

The popular encodings being utf-8, ascii, etc. Using the string encode() method, you can convert unicode strings into any encodings supported by Python. By default, Python uses utf-8 encoding.


1 Answers

PEP 393 does not specify any new API for converting from UTF-* encodings to Unicode. The old APIs still apply there.

If you do not need error handling, these 2 are still usable

PyObject* PyUnicode_FromStringAndSize(const char *u, Py_ssize_t size)
PyObject* PyUnicode_FromString(const char*u)

Only, u must not be NULL to the first one (it has been deprecated).

If you need error handling, or surrogate escapes, use PyUnicode_DecodeUTF8:

PyObject* PyUnicode_DecodeUTF8(const char *s, Py_ssize_t size, const char *errors)

These all use PyUnicode_DecodeUTF8Stateful internally. PyUnicode_DecodeUTF8Stateful now creates new canonical PyUnicodeObjects.


As for getting the UTF-8 representation of a PyUnicodeObject as char *, use either of

char* PyUnicode_AsUTF8AndSize(PyObject *unicode, Py_ssize_t *size)
char* PyUnicode_AsUTF8(PyObject *unicode)

This representation is cached in the PyUnicodeObject object itself and is valid for as long as that object itself is alive. These forms are especially useful if all the characters are ASCII, then the returned UTF-8 pointer can point just to the existing characters.

like image 130