Decoding Unicode text backwards

Question

Many text encodings have the property that you can go through encoded text backwards and still be able to decode it. ASCII, UTF-8, UTF-16, and UTF-32 all have this property. This lets you do handy things like read the last line of a file without reading all the lines before it, or go backwards a few lines from your current position in a file.

Unfortunately, Python doesn't seem to come with any way to decode a file backwards. You can't read backwards, or seek by character quantity in an encoded file. The decoders in the codecs module support incremental decoding forwards, but not backwards. There doesn't seem to be any "UTF-8-backwards" codec I could feed UTF-8 bytes to in reverse order.

I could probably implement the codec-dependent character boundary synchronization myself, read binary chunks backward, and feed properly-aligned chunks to appropriate decoders from the codecs module, but that sounds like the kind of thing where a non-expert would miss some subtle detail and not notice the output is wrong.

Is there any simple way to decode text backward in Python with existing tools?

Several people appear to have missed the point that reading the entire file to do this defeats the purpose. While I'm clarifying things, I might as well add that this needs to work for variable-length encodings, too. UTF-8 support is a must.

Robᵩ · Accepted Answer

Absent a general-purpose solution, here is one specific to utf-8:

def rdecode(it):
    buffer = []
    for ch in it:
        och = ord(ch)
        if not (och & 0x80):
            yield ch.decode('utf-8')
        elif not (och & 0x40):
            buffer.append(ch)
        else:
            buffer.append(ch)
            yield ''.join(reversed(buffer)).decode('utf-8')
            buffer = []

utf8 = 'ho math\xc4\x93t\xc4\x93s hon \xc4\x93gap\xc4\x81 ho I\xc4\x93sous'
print utf8.decode('utf8')
for i in rdecode(reversed(utf8)):
    print i,
print ""

Result:

$ python x.py 
ho mathētēs hon ēgapā ho Iēsous
s u o s ē I   o h   ā p a g ē   n o h   s ē t ē h t a m   o h

Decoding Unicode text backwards

Tags:

python

text

encoding

unicode

user2357112 supports Monica

1 Answers

Robᵩ

Recent Activity

Donate For Us

Decoding Unicode text backwards

Tags:

python

text

encoding

unicode

user2357112 supports Monica

1 Answers

Robᵩ

Related questions

Recent Activity

Donate For Us