I am reading a source-code which downloads the zip-file and reads the data into numpy array. The code suppose to work on macos and linux and here is the snippet that I see:
def _read32(bytestream):
dt = numpy.dtype(numpy.uint32).newbyteorder('>')
return numpy.frombuffer(bytestream.read(4), dtype=dt)
This function is used in the following context:
with gzip.open(filename) as bytestream:
magic = _read32(bytestream)
It is not hard to see what happens here, but I am puzzled with the purpose of newbyteorder('>')
. I read the documentation, and know what endianness mean, but can not understand why exactly developer added newbyteorder (in my opinion it is not really needed).
The benefit of little endianness is that a variable can be read as any length using the same address. One benefit of big-endian is that you can read 16-bit and 32-bit values as most humans do; from left to right.
Another reason it exists is because it seems that it wasn't standardized back in the 1960s and 1970s; some companies (such as Intel with their x86 architecture) decided to go with little-endian (possibly due to the optimization reasoning above), whereas other companies selected big-endian.
So Endianness comes into picture when you are sending and receiving data across the network from one host to another host. If the sender and receiver computer have different Endianness, then there is a need to swap the Endianness so that it is compatible.
Little and big endian are two ways of storing multibyte data-types ( int, float, etc). In little endian machines, last byte of binary representation of the multibyte data-type is stored first. On the other hand, in big endian machines, first byte of binary representation of the multibyte data-type is stored first.
That's because data downloaded is in big endian format as described in source page: http://yann.lecun.com/exdb/mnist/
All the integers in the files are stored in the MSB first (high endian) format used by most non-Intel processors. Users of Intel processors and other low-endian machines must flip the bytes of the header.
It is just a way of ensuring that the bytes are interpreted from the resulting array in the correct order, regardless of a system's native byteorder.
By default, the built in NumPy integer dtypes will use the byteorder that is native to your system. For example, my system is little-endian, so simply using the dtype numpy.dtype(numpy.uint32)
will mean that values read into an array from a buffer with the bytes in big-endian order will not be interpreted correctly.
If np.frombuffer
is to meant to recieve bytes that are known to be in a particular byteorder, the best practice is to modify the dtype using newbyteorder
. This is mentioned in the documents for np.frombuffer
:
Notes
If the buffer has data that is not in machine byte-order, this should be specified as part of the data-type, e.g.:
>>> dt = np.dtype(int) >>> dt = dt.newbyteorder('>') >>> np.frombuffer(buf, dtype=dt)
The data of the resulting array will not be byteswapped, but will be interpreted correctly.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With