Which encoding is used for strings in Python 2.x?

Question

What is the default encoding used for encoding strings in python 2.x? I've read that there are two possible ways to declare a string.

string = 'this is a string'
unicode_string = u'this is a unicode string'

The second string is in Unicode. What is the encoding of the first string?

ivan_pozdeev · Accepted Answer

As per Python default/implicit string encodings and conversions (reciting its Py2 part concisely, to minimize duplication):

There are actually multiple independent "default" string encodings in Python 2, used by different parts of its functionality.

Parsing the code and string literals:
- str from a literal -- will contain raw bytes from the file, no transcoding is done
- unicode from a literal -- the bytes from the file are decode'd with the file's "source encoding" which defaults to ascii
- with unicode_literals future, all literals in the file are treated as Unicode literals
Transcoding/type conversion:
- str<->unicode type conversion and encode/decode w/o arguments are done with sys.getdefaultencoding()
  - which is ascii almost always, so any national characters will cause a UnicodeError
- str can only be decode'd and unicode -- encode'd. Trying otherwise will involve an implicit type conversion (with the aforementioned result)
I/O, including printing:
- unicode -- encode'd with <file>.encoding if set, otherwise implicitly converted to str (with the aforementioned result)
- str -- raw bytes are written to the stream, no transcoding is done. For national characters, a terminal will show different glyphs depending on its locale settings.

Arthur Tacca · Answer

The literal answer is that they do not necessarily represent any particular encoding. In Python 2, a string is just an array of bytes, exactly like the bytes type in Python 3. For a string s you can call s.decode() to get a Unicode string, but you ~~must~~* pass the encoding manually for exactly that reason. You could use a string to hold ASCII bytes, or characters from Windows code page 850 (which is a superset of ASCII), or UTF8 bytes, or even UTF16 bytes. The last case is interesting because even if the characters in that string are in the ASCII range, the bytes do not match the ASCII-encoded version (they will alternate with the null character). The string type is even suitable for bytes of some binary format that do not correspond to any encoded string e.g. the bytes of an image file.

A more practical answer is that often ASCII is assumed. For example, the literal string "xyz" will give a three byte string with the bytes corresponding to the ASCII encoding of those characters.

This ambiguity is the reason for the change in behaviours and conventions around strings in Python 3.

* As noted in CristiFati's answer, it is possible to omit the encoding= argument to decode, in which case ASCII will be assumed. My mistake.

A more practical answer is that often ASCII is assumed. For example, the literal string "xyz" will give a three byte string with the bytes corresponding to the ASCII encoding of those characters.

This ambiguity is the reason for the change in behaviours and conventions around strings in Python 3.

* As noted in CristiFati's answer, it is possible to omit the encoding= argument to decode, in which case ASCII will be assumed. My mistake.

Which encoding is used for strings in Python 2.x?

Tags:

python

string

python-internals

encoding

python-2.x

Cortex

2 Answers

ivan_pozdeev

Arthur Tacca

Recent Activity

Donate For Us

Which encoding is used for strings in Python 2.x?

Tags:

python

string

python-internals

encoding

python-2.x

Cortex

2 Answers

ivan_pozdeev

Arthur Tacca

Related questions

Recent Activity

Donate For Us