Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python >= 3.3 Internal String Representation

I was looking into how Python represents string after PEP 393 and I am not understanding the difference between PyASCIIObject and PyCompactUnicodeObject.

My understanding is that strings are represented with the following structures:

typedef struct {
    PyObject_HEAD
    Py_ssize_t length;          /* Number of code points in the string */
    Py_hash_t hash;             /* Hash value; -1 if not set */
    struct {
        unsigned int interned:2;
        unsigned int kind:3;
        unsigned int compact:1;
        unsigned int ascii:1;
        unsigned int ready:1;
        unsigned int :24;
    } state;
    wchar_t *wstr;              /* wchar_t representation (null-terminated) */
} PyASCIIObject;

typedef struct {
    PyASCIIObject _base;
    Py_ssize_t utf8_length;
    char *utf8;
    Py_ssize_t wstr_length;
} PyCompactUnicodeObject;

typedef struct {
    PyCompactUnicodeObject _base;
    union {
        void *any;
        Py_UCS1 *latin1;
        Py_UCS2 *ucs2;
        Py_UCS4 *ucs4;
    } data;                 
} PyUnicodeObject;

Correct me if I am wrong, but my understanding is that PyASCIIObject is used for strings with ASCII characters only, PyCompactUnicodeObject uses the PyASCIIObject structure and it is used for strings with at least one non-ASCII character, and PyUnicodeObject is used for legacy functions. Is that correct?

Also, why PyASCIIObject uses wchar_t? Isn't a char enough to represent ASCII strings? In addition, if PyASCIIObject already has a wchar_t pointer, why does PyCompactUnicodeObject also have a char pointer? My understanding is that both pointers point to the same location, but why would you include both?

like image 446
Alberto O. Avatar asked Jun 16 '20 05:06

Alberto O.


People also ask

How are string represented internally in Python?

In Python 3.3 and above, the internal representation of the string will depend on the string, and can be any of latin-1, UCS-2 or UCS-4, as described in PEP 393. For previous Pythons, the internal representation depends on the build flags of Python.

What are Unicode strings in Python?

To summarize the previous section: a Unicode string is a sequence of code points, which are numbers from 0 through 0x10FFFF (1,114,111 decimal). This sequence of code points needs to be represented in memory as a set of code units, and code units are then mapped to 8-bit bytes.

How do you create a Unicode string in Python?

You have two options to create Unicode string in Python. Either use decode() , or create a new Unicode string with UTF-8 encoding by unicode(). The unicode() method is unicode(string[, encoding, errors]) , its arguments should be 8-bit strings.

How do I get rid of U in Python?

In python, to remove Unicode ” u “ character from string then, we can use the replace() method to remove the Unicode ” u ” from the string. After writing the above code (python remove Unicode ” u ” from a string), Ones you will print “ string_unicode ” then the output will appear as a “ Python is easy. ”.


1 Answers

PEP 373 is really the best reference for your questions, though the C-API docs are sometimes needed too. Lets address your questions one by one:

  1. You have the types right. But there is one non-obvious wrinkle: When you're using either of the "compact" types (either PyASCIIObject or PyCompactUnicodeObject), the structure itself is just a header. The string's actual data is stored immediately after the structure in memory. The encoding used by the data is described by the kind field, and will depend on the largest character value in the string.

  2. The wstr and utf8 pointers in the first two structures are places where a transformed representation can be stored if one is requested by C code. For an ASCII string (using the PyASCIIObject), no cache pointer is needed for UTF-8 data, since the ASCII data itself is UTF-8 compatible. The wide character cache is only used by deprecated functions.

    The two cache pointers will never point to the same place, since their types are not directly compatible. For compact strings, they are only allocated when a function that needs a UTF-8 buffer (e.g. PyUnicode_AsUTF8AndSize) or a Py_UNICODE buffer (e.g. the deprecated PyUnicode_AS_UNICODE) gets called.

    For strings created with the deprecated Py_UNICODE based APIs, the wstr pointer has an extra use. It points to the only version of the string data until the PyUnicode_READY macro is called on the string. The first time the string is readied, a new data buffer will be created, and the characters will be stored in it, using the most compact encoding possible among Latin-1, UTF-16 and UTF-32. The wstr buffer will be kept, as it might be needed later by other deprecated API functions that want to look up a PY_UNICODE string.

It is interesting that you're asking about CPython's internal string representations right now, as there's a discussion currently ongoing about whether deprecated string API functions and implementation details like the wchar * pointer can be removed in an upcoming version of Python. It looks like it might happen for Python 3.11.0 (which is expected to be released in 2022), though plans could still change before then, especially if the impact on code being used in the wild is more severe than expected.

like image 57
Blckknght Avatar answered Sep 27 '22 19:09

Blckknght