Stream/string/bytearray transformations in Python 3

Question

Python 3 cleans up Python's handling of Unicode strings. I assume as part of this effort, the codecs in Python 3 have become more restrictive, according to the Python 3 documentation compared to the Python 2 documentation.

For example, codecs that conceptually convert a bytestream to a different form of bytestream have been removed:

base64_codec
bz2_codec
hex_codec

And codecs that conceptually convert Unicode to a different form of Unicode have also been removed (in Python 2 it actually went between Unicode and bytestream, but conceptually it's really Unicode to Unicode I reckon):

rot_13

My main question is, what is the "right way" in Python 3 to do what these removed codecs used to do? They're not codecs in the strict sense, but "transformations". But the interface and implementation would be very similar to codecs.

I don't care about rot_13, but I'm interested to know what would be the "best way" to implement a transformation of line ending styles (Unix line endings vs Windows line endings) which should really be a Unicode-to-Unicode transformation done before encoding to byte stream, especially when UTF-16 is being used, as discussed this other SO question.

Craig McQueen · Accepted Answer

It looks as though all these non-codec modules are being handled on a case-by-case basis. Here's what I've found so far:

base64 is now available via base64 module
bz2 can now be done using bz2 module
hex string encoding/decoding can be done with the hexlify and unhexlify functions of the binascii module (a bit of a hidden feature)

I guess that means there's no standard framework for creating such string/bytearray transformation modules, but they're being done on a case-by-case basis in Python 3.

Update for Python 3.2

A comment on a blog post "Compressing text using Python’s unicode support" alerted me to the fact that these codecs are back for Python 3.2.

Quoting the comment:

Since these are “text-to-text” or “binary-to-binary” transforms, though, the encode()/decode() methods in Python 3.x don’t support this style of usage – it’s a Python 2.x only feature).

The codecs themselves are back in 3.2, but you need to go through the codecs module API in order to use them – they aren’t available via the object method shorthand.

Look in the Python 3 docs for codecs — Binary Transforms.

From a blog post by Barry Warsaw:

Did you know that Python 2 provides some codecs for doing interesting conversions such as Caeser rotation (i.e. rot13)? Thus, you can do things like:
>>> 'foo'.encode('rot-13')
'sbb'
This doesn't work in Python 3 though, because even though certain str-to-str codecs like rot-13 still exist, the str.encode() interface requires that the codec return a bytes object. In order to use str-to-str codecs in both Python 2 and Python 3, you'll have to pop the hood and use a lower-level API, getting and calling the codec directly:
>>> from codecs import getencoder
>>> encoder = getencoder('rot-13')
>>> rot13string = encoder(mystring)[0]
You have to get the zeroth-element from the return value of the encoder because of the codecs API. A bit ugly, but it works in both versions of Python.

Stream/string/bytearray transformations in Python 3

Tags:

python-3.x

encoding

Craig McQueen

1 Answers

Update for Python 3.2

Craig McQueen

Recent Activity

Donate For Us

Stream/string/bytearray transformations in Python 3

Tags:

python-3.x

encoding

Craig McQueen

1 Answers

Update for Python 3.2

Craig McQueen

Related questions

Recent Activity

Donate For Us