Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

packing boolean array needs go throught int (numpy 1.8.2)

I'm looking for the more compact way to store boolean. numpy internally need 8bits to store one boolean, but np.packbits allow to pack them, that's pretty cool.

The problem is that to pack in a 4e6 bytes array a 32e6 bytes array of boolean we need to first spend 256e6 bytes to convert the boolean array in int array !

In [1]: db_bool = np.array(np.random.randint(2, size=(int(2e6), 16)), dtype=bool)
In [2]: db_int = np.asarray(db_bool, dtype=int)
In [3]: db_packed = np.packbits(db_int, axis=0)
In [4]: db.nbytes, db_int.nbytes, db_packed.nbytes
Out[5]: (32000000, 256000000, 4000000)

There is a one year old issue opened in the numpy tracker about that (Cf. https://github.com/numpy/numpy/issues/5377 )

Has someone a solution/better workaround ?

The traceback when we try to do it the right way:

In [28]: db_pb = np.packbits(db_bool)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-28-3715e167166b> in <module>()
----> 1 db_pb = np.packbits(db_bool)
TypeError: Expected an input array of integer data type
In [29]:

PS: I will give bitarray a try but would have get it in pure numpy.

like image 356
user3313834 Avatar asked Jan 07 '23 22:01

user3313834


2 Answers

There's no need to convert your boolean array to the native int dtype (which will be 64 bit on x86_64). You can avoid copying your boolean array by viewing it as np.uint8, which also uses a single byte per element:

packed = np.packbits(db_bool.view(np.uint8))

unpacked = np.unpackbits(packed)[:db_bool.size].reshape(db_bool.shape).view(np.bool)

print(np.all(db_bool == unpacked))
# True

Also, np.packbits should now work directly on boolean arrays as of this commit from over a year ago (numpy v1.10.0 and newer).

like image 156
ali_m Avatar answered Jan 13 '23 11:01

ali_m


Just yesterday, I answered a question to a newcomer on how to deal with bits in Python - as compared to C++. After warning there would be no speed gains, I sketched-up a naive "bitarray" using internally Python's bytearray objects.

This is in no way fast - but if you are no longer operating on your array bits, and just want the output, maybe it is good enough - as you have full control in Python code about the conversion. Otherwise, you can try just hinting the static types and run the same code as Cython, and you will probably want to use an np array with dtype=int8 instead of a bytearray:

class BitArray(object):
    def __init__(self, length):
        self.values = bytearray(b"\x00" * (length // 8 + (1 if length % 8  else 0)))
        self.length = length

    def __setitem__(self, index, value):
        value = int(bool(value)) << (7 - index % 8)
        mask = 0xff ^ (7 - index % 8)
        self.values[index // 8] &= mask
        self.values[index // 8] |= value
    def __getitem__(self, index):
        mask = 1 << (7 - index % 8)
        return bool(self.values[index // 8] & mask)

    def __len__(self):
        return self.length

    def __repr__(self):
        return "<{}>".format(", ".join("{:d}".format(value) for value in self))

This code was originally posted here: Is there a builtin bitset in Python that's similar to the std::bitset from C++?

like image 44
jsbueno Avatar answered Jan 13 '23 10:01

jsbueno