NumPy seems to lack built-in support for 3-byte and 6-byte types, aka uint24
and uint48
.
I have a large data set using these types and want to feed it to numpy. What I currently do (for uint24):
import numpy as np
dt = np.dtype([('head', '<u2'), ('data', '<u2', (3,))])
# I would like to be able to write
# dt = np.dtype([('head', '<u2'), ('data', '<u3', (2,))])
# dt = np.dtype([('head', '<u2'), ('data', '<u6')])
a = np.memmap("filename", mode='r', dtype=dt)
# convert 3 x 2byte data to 2 x 3byte
# w1 is LSB, w3 is MSB
w1, w2, w3 = a['data'].swapaxes(0,1)
a2 = np.ndarray((2,a.size), dtype='u4')
# 3 LSB
a2[0] = w2 % 256
a2[0] <<= 16
a2[0] += w1
# 3 MSB
a2[1] = w3
a2[1] <<=8
a2[1] += w2 >> 8
# now a2 contains "uint24" matrix
While it works for 100MB input, it looks inefficient (think of 100s GBs of data). Is there a more efficient way? For example, creating a special kind of read-only view which masks part of the data would be useful (kind of "uint64 with two MSBs always zero" type). I only need read-only access to the data.
NumPy dtypeA data type object implements the fixed size of memory corresponding to an array. We can create a dtype object by using the following syntax. The constructor accepts the following object. Object: It represents the object which is to be converted to the data type.
dtype dtype('uint8') dtype objects also contain information about the type, such as its bit-width and its byte-order. The data type can also be used indirectly to query properties of the type, such as whether it is an integer: >>> d = np. dtype(int) >>> d dtype('int32') >>> np.
# dtype('<U11') In the first case, each element of the list we pass to the array constructor is an integer. Therefore, NumPy decides that the dtype should be integer (32 bit integer to be precise). In the second case, one of the elements (3.0) is a floating-point number.
I don't believe there's a way to do what you're asking (it would require unaligned access, which is highly inefficient on some architectures). My solution from Reading and storing arbitrary byte length integers from a file might be more efficient at transferring the data to an in-process array:
a = np.memmap("filename", mode='r', dtype=np.dtype('>u1'))
e = np.zeros(a.size / 6, np.dtype('>u8'))
for i in range(3):
e.view(dtype='>u2')[i + 1::4] = a.view(dtype='>u2')[i::3]
You can get unaligned access using the strides
constructor parameter:
e = np.ndarray((a.size - 2) // 6, np.dtype('<u8'), buf, strides=(6,))
However with this each element will overlap with the next, so to actually use it you'd have to mask out the high bytes on access.
There's an answer for this over at: How do I create a Numpy dtype that includes 24 bit integers?
It's a bit ugly, but does exactly what you want: Allows you to index your ndarray like it's got a dtype of <u3
so you can memmap()
big data from disk.
You still need to manually apply a bitmask to clear out the fourth overlapping byte, but that can be applied to the sliced (multidimensional) array after access.
The trick is to abuse the 'stride' part of an ndarray, so that indexing works. In order to make it work without it complaining about limits, there's a special trick.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With