Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NumPy: 3-byte, 6-byte types (aka uint24, uint48)

NumPy seems to lack built-in support for 3-byte and 6-byte types, aka uint24 and uint48. I have a large data set using these types and want to feed it to numpy. What I currently do (for uint24):

import numpy as np
dt = np.dtype([('head', '<u2'), ('data', '<u2', (3,))])
# I would like to be able to write
#  dt = np.dtype([('head', '<u2'), ('data', '<u3', (2,))])
#  dt = np.dtype([('head', '<u2'), ('data', '<u6')])
a = np.memmap("filename", mode='r', dtype=dt)
# convert 3 x 2byte data to 2 x 3byte
# w1 is LSB, w3 is MSB
w1, w2, w3 = a['data'].swapaxes(0,1)
a2 = np.ndarray((2,a.size), dtype='u4')
# 3 LSB
a2[0] = w2 % 256
a2[0] <<= 16
a2[0] += w1
# 3 MSB
a2[1] = w3
a2[1] <<=8
a2[1] += w2 >> 8
# now a2 contains "uint24" matrix

While it works for 100MB input, it looks inefficient (think of 100s GBs of data). Is there a more efficient way? For example, creating a special kind of read-only view which masks part of the data would be useful (kind of "uint64 with two MSBs always zero" type). I only need read-only access to the data.

like image 641
Ilia K. Avatar asked Aug 15 '12 09:08

Ilia K.


People also ask

What are NumPy data type?

NumPy dtypeA data type object implements the fixed size of memory corresponding to an array. We can create a dtype object by using the following syntax. The constructor accepts the following object. Object: It represents the object which is to be converted to the data type.

What is Dtype NP uint8?

dtype dtype('uint8') dtype objects also contain information about the type, such as its bit-width and its byte-order. The data type can also be used indirectly to query properties of the type, such as whether it is an integer: >>> d = np. dtype(int) >>> d dtype('int32') >>> np.

What is Dtype U11 NumPy?

# dtype('<U11') In the first case, each element of the list we pass to the array constructor is an integer. Therefore, NumPy decides that the dtype should be integer (32 bit integer to be precise). In the second case, one of the elements (3.0) is a floating-point number.


2 Answers

I don't believe there's a way to do what you're asking (it would require unaligned access, which is highly inefficient on some architectures). My solution from Reading and storing arbitrary byte length integers from a file might be more efficient at transferring the data to an in-process array:

a = np.memmap("filename", mode='r', dtype=np.dtype('>u1'))
e = np.zeros(a.size / 6, np.dtype('>u8'))
for i in range(3):
    e.view(dtype='>u2')[i + 1::4] = a.view(dtype='>u2')[i::3]

You can get unaligned access using the strides constructor parameter:

e = np.ndarray((a.size - 2) // 6, np.dtype('<u8'), buf, strides=(6,))

However with this each element will overlap with the next, so to actually use it you'd have to mask out the high bytes on access.

like image 185
ecatmur Avatar answered Sep 21 '22 06:09

ecatmur


There's an answer for this over at: How do I create a Numpy dtype that includes 24 bit integers?

It's a bit ugly, but does exactly what you want: Allows you to index your ndarray like it's got a dtype of <u3 so you can memmap() big data from disk.
You still need to manually apply a bitmask to clear out the fourth overlapping byte, but that can be applied to the sliced (multidimensional) array after access.

The trick is to abuse the 'stride' part of an ndarray, so that indexing works. In order to make it work without it complaining about limits, there's a special trick.

like image 28
RGD2 Avatar answered Sep 24 '22 06:09

RGD2