Python 3.3 C-API and UTF-8 Strings

Tags:

So, Python 3.3 introduced PEP 393, which changes the implementation of Python str objects (Unicode objects) so that internally, they are represented in a memory-efficient way while still allowing for random access to each character. This is done by scanning the string for the largest Unicode code point, and then choosing a Unicode encoding based on the size of the largest code point. This way ASCII strings only require 1 byte per character, but strings with say, East Asian (Chinese/Japanese/Korean) code points will require 4 bytes per character. This is much more efficient, memory-wise, than earlier implementations, which always used 2 or 4 bytes per character regardless of the actual code points in the string.

So I understand the purpose of PEP 393, but I'm really confused about how I'm supposed to go about creating a Python Unicode object from a UTF-8 encoded C string. Creating a Python str from a UTF-8 C string is an extremely common requirement. In fact, it's so common that the old (pre Python 3.3) C-API had a function entirely for that purpose:

PyObject* PyUnicode_FromStringAndSize(const char *u, Py_ssize_t size)

This function takes a UTF-8 C-string along with a size parameter, and creates a Python Unicode object from that string.

However, this function is now deprecated after PEP 393. But I'm still not really sure exactly what is supposed to replace it.

At first I thought we could use the new PyUnicode_FromKindAndData function:

PyObject* PyUnicode_FromKindAndData(int kind, const void *buffer, Py_ssize_t size)

This new function takes a "kind" parameter, which indicates whether the input buffer is UCS1, UCS2, or UCS4. Based on that parameter it creates a new compact Python Unicode object. So, at first I thought that saying PyUnicode_FromKindAndData(PyUnicode_1BYTE_KIND, buf, size) would basically be equivalent to the old PyUnicode_FromStringAndSize. I was thinking that PyUnicode_1BYTE_KIND means that Python should assume that the input buffer is a UTF-8 string. But this doesn't seem to be the case. UCS1 is not the same as UTF-8, and PyUnicode_1BYTE_KIND seems to just indicate that the input buffer has 1 byte per character - which is not the same as UTF-8, which is variable length and can have anywhere from 1 to 4 bytes per character.

So then, how can we create a Python Unicode object from a UTF-8 C string using the new PEP 393 API?

After reading the docs and PEP 393 itself, it seems to me that the only way to do this is to manually compute the maximum character yourself, and then call PyUnicode_New. Then iterate over the newly created string buffer, and manually convert each code point in the UTF-8 C string into the correct encoding based on the maxchar, copying each character using PyUnicode_WRITE in a loop.

... except I'm a bit surprised that the API would actually require all of this manual work - including all the conversions from UTF-8 to UTF-32 or UTF-16, taking into account things like surrogate pairs and all that. Basically doing all these conversions manually is a lot of effort, and I'm surprised the Python C-API doesn't expose functions to do this in an easier way. I mean, obviously such code exists in the Python source, since the old deprecated PyUnicode_FromStringAndSize was doing exactly that. It was converting UTF-8 to UTF-16 or UTF-32 (depending on the platform). But now with PEP 393, it seems all of this has to be done manually.

So am I missing something? Is there an easier way to create a Python Unicode object using a UTF-8 C string as input? Or is it really necessary to do this all manually if we wish to avoid using the deprecated functions?

780

asked Mar 19 '16 07:03

Siler

1 Answers

PEP 393 does not specify any new API for converting from UTF-* encodings to Unicode. The old APIs still apply there.

If you do not need error handling, these 2 are still usable

PyObject* PyUnicode_FromStringAndSize(const char *u, Py_ssize_t size)
PyObject* PyUnicode_FromString(const char*u)

Only, u must not be NULL to the first one (it has been deprecated).

If you need error handling, or surrogate escapes, use PyUnicode_DecodeUTF8:

PyObject* PyUnicode_DecodeUTF8(const char *s, Py_ssize_t size, const char *errors)

These all use PyUnicode_DecodeUTF8Stateful internally. PyUnicode_DecodeUTF8Stateful now creates new canonical PyUnicodeObjects.

As for getting the UTF-8 representation of a PyUnicodeObject as char *, use either of

char* PyUnicode_AsUTF8AndSize(PyObject *unicode, Py_ssize_t *size)
char* PyUnicode_AsUTF8(PyObject *unicode)

This representation is cached in the PyUnicodeObject object itself and is valid for as long as that object itself is alive. These forms are especially useful if all the characters are ASCII, then the returned UTF-8 pointer can point just to the existing characters.

130

answered Oct 24 '22 22:10

Antti Haapala -- Слава Україні

Related questions
                            
                                Python GEOS ImportError
                            
                                django, property update a model instance
                            
                                CPython - Read Python Dictionary (keys/values) inside a C Function Passed as argument
                            
                                How can i detect one word with speech recognition in Python
                            
                                How to optimize image size using wand in python
                            
                                Django/sqlite3 "OperationalError: no such table" on threaded operation
                            
                                Python Scapy / operator, | pipe in types
                            
                                Python package installation: pip vs yum, or both together?
                            
                                Wiener Filter for image deblur
                            
                                Why does Pandas coerce my numpy float32 to float64?
                            
                                Multiple graphs on the same plot in seaborn
                            
                                Python PrettyTable: Add title above the table's header
                            
                                Best practices for architecturing data validation in a Django multi sided project [closed]
                            
                                Can I add field in __init__ wtforms
                            
                                How can I vectorize this for loop in numpy?
                            
                                Convert images drawn by turtle to PNG in Python
                            
                                Can I use same virtual environment on different computers
                            
                                Why can't the underscore be matched by '\W'?
                            
                                gzip fails at writing high amount of data in file
                            
                                Iterate through a dynamic number of for loops (Python)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python 3.3 C-API and UTF-8 Strings

Tags:

python

c

python-internals

python-c-api

unicode

Siler

People also ask

1 Answers

Antti Haapala -- Слава Україні

Recent Activity

Donate For Us