Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How is unicode represented internally in Python?

Tags:

How is Unicode string literally represented in Python's memory?

For example I could visualize 'abc' as its equivalent ASCII bytes in Memory. Integer could be thought of as the 2's compliment representation. However u'\u2049', even though is represented in UTF-8 as '\xe2\x81\x89' - 3 bytes long, how do I visualize the literal u'\u2049' codepoint in the memory?

Is there a specific way it is stored in memory? Does Python 2 and Python 3 treat it differently?

Few related questions for anyone curious :

1) How are these strings represented internally in Python interpreter ? I don't understand

2) What is internal representation of string in Python 3.x

like image 741
Nishant Avatar asked Sep 27 '14 21:09

Nishant


People also ask

How do you represent Unicode characters in Python?

To include Unicode characters in your Python source code, you can use Unicode escape characters in the form \u0123 in your string. In Python 2. x, you also need to prefix the string literal with 'u'.

How does Python Store Unicode?

Python 2 uses str type to store bytes and unicode type to store unicode code points. All strings by default are str type — which is bytes~ And Default encoding is ASCII. So if an incoming file is Cyrillic characters, Python 2 might fail because ASCII will not be able to handle those Cyrillic Characters.

How does Unicode represent data?

Unicode itself is not a representation – it is a character set. In order to represent Unicode characters as bits, a Unicode encoding scheme is used. The Unicode encoding scheme tells us how each number (which corresponds to a Unicode character) should be represented with a pattern of bits.

What is internal representation in Python?

Briefly, the internal representation in a unicode object is an array of 16-bit unsigned integers, or an array of 32-bit unsigned integers (using only 21 bits).


1 Answers

I'm assuming you want to know about CPython, the standard implementation. Python 2 and Python 3.0-3.2 use either UCS2* or UCS4 for Unicode characters, meaning it'll either use 2 bytes or 4 bytes for each character. Which one is picked is a compile-time option.

\u2049 is then represented as either \x49\x20 or \x20\x49 or \x49\x20\x00\x00 or \x00\x00\x20\x49 depending on the native byte order of your system and if UCS2 or UCS4 was picked. ASCII characters in a unicode string still use 2 or 4 bytes per character too.

Python 3.3 switched to a new internal representation, using the most compact form needed to represent all characters in a string. Either 1 byte, 2 bytes or 4 bytes are picked. ASCII and Latin-1 text uses just 1 byte per character, the rest of the BMP characters require 2 bytes and after that 4 bytes is used.

See PEP-393: Flexible String Representation for the full low-down on these representations.


* Technically speaking the UCS-2 build uses UTF-16, as non-BMP characters use UTF-16 surrogates to encode to 4 bytes (2 UTF-16 characters) each. However, Python documentation still refers to this as UCS2.

This does lead to unexpected behaviour such as the len() on non-BMP unicode strings being longer than the number of characters contained.

like image 146
Martijn Pieters Avatar answered Oct 02 '22 00:10

Martijn Pieters