Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I properly create custom text codecs?

Tags:

I'm digging through some old binaries that contain (among other things) text. Their text frequently uses custom character encodings for Reasons, and I want to be able to read and rewrite them.

It seems to me that the appropriate way to do this is to create a custom codec using the standard codecs library. Unfortunately its documentation is both colossal and entirely bereft of examples. Google turns up a few, but only for python2, and I'm using 3.

I'm looking for a minimal example of how to use the codecs library to implement a custom character encoding.

like image 977
Andrew Avatar asked Aug 04 '16 21:08

Andrew


People also ask

What is custom encoding?

What Is A Custom Encoding? A character encoding in Expat is a combination of tables and functions that translates a sequence of bytes into Unicode codepoints and from there to UTF-8 or UTF-16 (as configured at compile time) for the library's internal use.

What does codecs mean in Python?

The codecs module defines a set of base classes which define the interface and can also be used to easily write your own codecs for use in Python. Each codec has to define four interfaces to make it usable as codec in Python: stateless encoder, stateless decoder, stream reader and stream writer.


2 Answers

While the online documentation is certainly sparse, you can get a lot more information by looking at the source code. The docstrings and comments are quite clear, and the definitions for the parent classes (Codec, IncrementalEncoder, etc.) are ready to be copy/pasted for a start to your codec (be sure to replace the object in each class definition with the name of the class you're inheriting from). It's also worth looking at the example I linked to in the comments for how to assemble/register it.

I've been stuck at the same point as you for a while looking through this, so good luck! If I have time in a few days, I'll see about actually making that implementation and pasting/linking to it here.

like image 36
krs013 Avatar answered Oct 08 '22 12:10

krs013


You asked for minimal!

  • Write a encode function and a decode function.
  • Write a "search function" that returns a CodecInfo object constructed from the above encoder and decoder.
  • Use codec.register to register a function that returns the above CodecInfo object.

Here is an example that converts the lowercase letters a-z to 0-25 in order.

import codecs
import string

from typing import Tuple

# prepare map from numbers to letters
_encode_table = {str(number): bytes(letter, 'ascii') for number, letter in enumerate(string.ascii_lowercase)}

# prepare inverse map
_decode_table = {ord(v): k for k, v in _encode_table.items()}


def custom_encode(text: str) -> Tuple[bytes, int]:
    # example encoder that converts ints to letters
    # see https://docs.python.org/3/library/codecs.html#codecs.Codec.encode
    return b''.join(_encode_table[x] for x in text), len(text)


def custom_decode(binary: bytes) -> Tuple[str, int]:
    # example decoder that converts letters to ints
    # see https://docs.python.org/3/library/codecs.html#codecs.Codec.decode
    return ''.join(_decode_table[x] for x in binary), len(binary)


def custom_search_function(encoding_name):
    return codecs.CodecInfo(custom_encode, custom_decode, name='Reasons')


def main():

    # register your custom codec
    # note that CodecInfo.name is used later
    codecs.register(custom_search_function)

    binary = b'abcdefg'
    # decode letters to numbers
    text = codecs.decode(binary, encoding='Reasons')
    print(text)
    # encode numbers to letters
    binary2 = codecs.encode(text, encoding='Reasons')
    print(binary2)
    # encode(decode(...)) should be an identity function
    assert binary == binary2

if __name__ == '__main__':
    main()

Running this prints

$ python codec_example.py
0123456
b'abcdefg'

See https://docs.python.org/3/library/codecs.html#codec-objects for details on the Codec interface. In particular, the decode function

... decodes the object input and returns a tuple (output object, length consumed).

whereas the encode function

... encodes the object input and returns a tuple (output object, length consumed).

Note that you should also worry about handling streams, incremental encoding/decoding, as well as error handling. For a more complete example, refer to the hexlify codec that @krs013 mentioned.


P.S. instead of of codec.decode, you can also use codec.open(..., encoding='Reasons').

like image 172
James Lim Avatar answered Oct 08 '22 14:10

James Lim