I'm trying to design a system to react to different binary flags.
0 = Error
1 = Okay
2 = Logging
3 = Number
The sequence of this data represents a unique ID to reference the work, the flag and the number. Everything works, except the number flag. This is what I get...
>>> import struct
>>> data = (1234, 3, 12345678)
>>> bin = struct.pack('QHL', *data)
>>> print(bin)
b'\xd2\x04\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00Na\xbc\x00\x00\x00\x00\x00'
>>> result = struct.unpack_from('QH', bin, 0)
>>> print(result)
(1234, 3)
>>> offset = struct.calcsize('QH')
>>> result += struct.unpack_from('L', bin, offset)
>>> print(result)
(1234, 3, 7011541669862440960)
A long should be plenty big to represent the number 12345678
, but why is it incorrectly unpacked?
Edit:
When I try to pack them separately, it looks like struct is adding too many null bytes between the flag and the long.
>>> import struct
>>> struct.pack('QH', 1234, 3)
b'\xd2\x04\x00\x00\x00\x00\x00\x00\x03\x00'
>>> struct.pack('L', 12345678)
b'Na\xbc\x00\x00\x00\x00\x00'
I can reproduce this error by adding padding before the long.
>>> struct.unpack('L', struct.pack('L', 12345678))
(12345678,)
>>> struct.unpack('xL', struct.pack('xL', 12345678))
(12345678,)
>>> struct.pack('xL', 12345678)
b'\x00\x00\x00\x00\x00\x00\x00\x00Na\xbc\x00\x00\x00\x00\x00'
Potential fix?
When I use little-endian order, the problem seems to correct itself and make the binary string shorter. Since this is destined for a SSL wrapped TCP socket, that's a win win, right? Keeping bandwidth low is generally good, yes?
>>> import struct
>>> data = (1234, 3, 12345678)
>>> bin = struct.pack('<QHL', *data)
>>> print(bin)
b'\xd2\x04\x00\x00\x00\x00\x00\x00\x03\x00Na\xbc\x00'
>>> result = struct.unpack_from('<QH', bin, 0)
>>> print(result)
(1234, 3)
>>> offset = struct.calcsize('<QH')
>>> result += struct.unpack_from('<L', bin, offset)
>>> print(result)
(1234, 3, 12345678)
Why does this happen? I am perplexed.
You are running into byte alignment issues. You need to know that by default the individual parts of a struct are not just placed next to each other but they are properly aligned in memory. This makes it more efficient, especially for other applications, as they have more direct way to access individual bytes from it without having to account for overlap.
You can easily see this by using struct.calcsize
to see the required space needed to encode using a format:
>>> struct.calcsize('QHL')
16
>>> struct.calcsize('QH')
10
As you can see QHL
requires 16 bytes, but QH
requires 10. The L
we left off is however only 4 bytes wide. So there is some padding going to on make sure that the L
starts again on “a fresh block”. This is because any type requires (with padding) that it starts on a offset that is a multiple of its own size. For QH
it looks like this:
QQ QQ | QQ QQ | HH
Once you use QHL
, you get the following:
QQ QQ | QQ QQ | HH 00 | LL LL
As you can see, there were two padding bytes added to make sure that L
starts on a new block of four.
You can modify the alignment (as well as the endianness) using a special character at the beginning of the format string. In your case, you could use =QHL
to disable alignment altogether:
QQ QQ | QQ QQ | HH LL | LL
When I use little-endian order, the problem seems to correct itself and make the binary string shorter. Since this is destined for a SSL wrapped TCP socket, that's a win win, right? Keeping bandwidth low is generally good, yes?
Using an explicit byte order also disables alignment yes, so that’s where the effect comes from. If it’s a good idea to turn of alignment depends though. If you want to use consume your data somewhere else, in other programs, it would be a good idea to stick to native alignment.
The correct output in your case:
>>> import struct
>>> data = (1234, 3, 12345678)
>>> bin = struct.pack('QHL', *data)
>>> print(bin)
b'\xd2\x04\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00Na\xbc\x00\x00\x00\x00\x00'
>>> result = struct.unpack_from('QH', bin, 0)
>>> print(result)
(1234, 3)
>>> result += struct.unpack_from('L', bin, 16)
>>> print(result)
(1234, 3, 12345678)
This happens because:
Padding is only automatically added between successive structure members.
Also, the reason your fix works is:
No padding is added when using non-native size and alignment, e.g. with ‘<’, ‘>’, ‘=’, and ‘!’.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With