Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is internal representation of string in Python 3.x

In Python 3.x, a string consists of items of Unicode ordinal. (See the quotation from the language reference below.) What is the internal representation of Unicode string? Is it UTF-16?

The items of a string object are Unicode code units. A Unicode code unit is represented by a string object of one item and can hold either a 16-bit or 32-bit value representing a Unicode ordinal (the maximum value for the ordinal is given in sys.maxunicode, and depends on how Python is configured at compile time). Surrogate pairs may be present in the Unicode object, and will be reported as two separate items.

like image 471
thebat Avatar asked Dec 03 '09 06:12

thebat


People also ask

What is internal representation in Python?

Briefly, the internal representation in a unicode object is an array of 16-bit unsigned integers, or an array of 32-bit unsigned integers (using only 21 bits).

What is string representation in Python?

Strings are Arrays Like many other popular programming languages, strings in Python are arrays of bytes representing unicode characters. However, Python does not have a character data type, a single character is simply a string with a length of 1.

How are string stored internally in Python 3?

Answer: How are strings stored internally in Python 3? They are stored internally as a Unicode sequence with a know codec. That means that they are a sequence of bytes where each character might be one, two, three or four bytes depending on which Unicode page this characters are from.

What is X in Python string?

The leading \x escape sequence means the next two characters are interpreted as hex digits for the character code, so \xaa equals chr(0xaa) , i.e., chr(16 * 10 + 10) -- a small raised lowercase 'a' character.


2 Answers

The internal representation will change in Python 3.3 which implements PEP 393. The new representation will pick one or several of ascii, latin-1, utf-8, utf-16, utf-32, generally trying to get a compact representation.

Implicit conversions into surrogate pairs will only be done when talking to legacy APIs (those only exist on windows, where wchar_t is two bytes); the Python string will be preserved. Here are the release notes.

like image 92
Tobu Avatar answered Sep 23 '22 19:09

Tobu


In Python 3.3 and above, the internal representation of the string will depend on the string, and can be any of latin-1, UCS-2 or UCS-4, as described in PEP 393.

For previous Pythons, the internal representation depends on the build flags of Python. Python can be built with flag values --enable-unicode=ucs2 or --enable-unicode=ucs4. ucs2 builds do in fact use UTF-16 as their internal representation, and ucs4 builds use UCS-4 / UTF-32.

like image 23
Matthew Brett Avatar answered Sep 21 '22 19:09

Matthew Brett