Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What if I put two kinds of encoded strings, say utf-8 and utf-16, in one file?

In Python, for example:

f = open('test','w')
f.write('this is a test\n'.encode('utf-16'))
f.write('another test\n'.encode('utf-8'))
f.close()

That file gets messy when I re-open it:

f = open("test")
print f.readline().decode('utf-16')  # it leads to UnicodeDecodeError
print f.readline().decode('utf-8')   # it works fine

However if I keep the texts encoded in one style (say utf-16 only), it could read back ok. So I'm guessing mixing two types of encoding in the same file is wrong and couldn't be decoded back, even if I do know the encoding rules of each specific string? Any suggestion is welcome, thank you!

like image 390
nrek Avatar asked Nov 24 '25 17:11

nrek


2 Answers

This is usually a bad idea, but in your case it doesn't work because you encode newlines as well.

In UTF-16, every character is encoded to two bytes, including the newline you wrote. Because you read your file line by line, python will give you all data from the file up to the next newline byte, but in UTF-16 that could mean that one of the two bytes is still included in the returned data resulting in an incomplete UTF-16 byte stream.

To understand this, you need to understand UTF-16 encoding in more detail. When writing 16-bit data as 2 bytes of 8 bits, computers need to decide which byte to write to the file first. This decision can go two ways, and is called endianess; like Gulliver's Lilliputs, computer systems prefer either Big or Little endian ordering.

An UTF-16 data stream is thus written in one of two orderings, and a Byte Order Mark or "BOM" is written first to flag which one was choosen.

Your newline is thus either encoded as '\n\x00' or '\x00\n', and on reading that null byte (\x00) is either part of the UTF-16 data you decode, or the UTF-8 data (where it is ignored). So, if you encode UTF-16 as big endian, things work (but you have a stray null byte), but if you encode as little endian, things break.

Basically, encoded data should be treated strictly as binary data and you should use a different method to delineate different pieces of encoded text, or you should only use encodings where newlines are strictly encoded as newlines.

I'd use a length prefix, read that first, then read that number of bytes from the file for each encoded piece of data.

>>> import struct
>>> f = open('test', 'wb')
>>> entry1 = 'this is a test\n'.encode('utf-16')
>>> struct.pack('!h', len(entry1)))
>>> f.write(entry1)
>>> entry2 = 'another test\n'.encode('utf-8')
>>> f.write(struct.pack('!h', len(entry2)))
>>> f.write(entry2)
>>> f.close()

I've used the struct module to write fixed-length length data. Note that I write the file as binary, too.

Reading:

>>> f = open('test', 'rb')
>>> fieldsize = struct.calcsize('!h')
>>> length = struct.unpack('!h', f.read(fieldsize))[0]
>>> print f.read(length).decode('utf-16')
this is a test

>>> length = struct.unpack('!h', f.read(fieldsize))[0]
>>> print f.read(length).decode('utf-8')
another test

>>>

Again the file is opened in binary mode.

In a real-life application you probably have to include the encoding information per entry as well.

like image 160
Martijn Pieters Avatar answered Nov 27 '25 05:11

Martijn Pieters


A working version of your code. Basically don't encode the newlines, and remove them when call readline() method:

f = open('test','w')
f.write('this is a test'.encode('utf-16'))
f.write("\n")
f.write('another test'.encode('utf-8'))
f.write("\n")
f.close()

f = open("test")
print f.readline().strip("\n").decode('utf-16')
print f.readline().strip("\n").decode('utf-8')
like image 29
AlbertFerras Avatar answered Nov 27 '25 05:11

AlbertFerras



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!