Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Decoding Unicode text backwards

Many text encodings have the property that you can go through encoded text backwards and still be able to decode it. ASCII, UTF-8, UTF-16, and UTF-32 all have this property. This lets you do handy things like read the last line of a file without reading all the lines before it, or go backwards a few lines from your current position in a file.

Unfortunately, Python doesn't seem to come with any way to decode a file backwards. You can't read backwards, or seek by character quantity in an encoded file. The decoders in the codecs module support incremental decoding forwards, but not backwards. There doesn't seem to be any "UTF-8-backwards" codec I could feed UTF-8 bytes to in reverse order.

I could probably implement the codec-dependent character boundary synchronization myself, read binary chunks backward, and feed properly-aligned chunks to appropriate decoders from the codecs module, but that sounds like the kind of thing where a non-expert would miss some subtle detail and not notice the output is wrong.

Is there any simple way to decode text backward in Python with existing tools?


Several people appear to have missed the point that reading the entire file to do this defeats the purpose. While I'm clarifying things, I might as well add that this needs to work for variable-length encodings, too. UTF-8 support is a must.

like image 244
user2357112 supports Monica Avatar asked Apr 12 '16 19:04

user2357112 supports Monica


1 Answers

Absent a general-purpose solution, here is one specific to utf-8:

def rdecode(it):
    buffer = []
    for ch in it:
        och = ord(ch)
        if not (och & 0x80):
            yield ch.decode('utf-8')
        elif not (och & 0x40):
            buffer.append(ch)
        else:
            buffer.append(ch)
            yield ''.join(reversed(buffer)).decode('utf-8')
            buffer = []

utf8 = 'ho math\xc4\x93t\xc4\x93s hon \xc4\x93gap\xc4\x81 ho I\xc4\x93sous'
print utf8.decode('utf8')
for i in rdecode(reversed(utf8)):
    print i,
print ""

Result:

$ python x.py 
ho mathētēs hon ēgapā ho Iēsous
s u o s ē I   o h   ā p a g ē   n o h   s ē t ē h t a m   o h 
like image 79
Robᵩ Avatar answered Oct 19 '22 13:10

Robᵩ