I have hundreds of thousands of NumPy boolean arrays that I would like to use as keys to a dictionary. (The values of this dictionary are the number of times we've observed each of these arrays.) Since NumPy arrays are not hashable and can't be used as keys themselves. I would like to serialize these arrays as efficiently as possible.
We have two definitions for efficiency to address, here:
I'm looking to strike a good balance between these two competing interests, however, efficient memory usage is more important to me and I'm willing to sacrifice computing time.
There are two properties that I hope will make this task easier:
1
s and 0
s, a bit sequenceIs there an efficient Python (2.7, or, if possible, 2.6) data structure that I could serialize these to (perhaps some sort of bytes structure), and could you provide an example of the conversion between an array and this structure, and from the structure back to the original array?
Note that it is not necessary to store information about whether each index was True
or False
; a structure that simply stored indices where the array was True
would be sufficient to reconstitute the array.
A sufficient solution would work for a 1-dimensional array, but a good solution would also work for a 2-dimensional array, and a great solution would work for arrays of even higher dimensions.
I would convert the array to an bitfield using np.packbits. This is fairly memory efficient, it uses all the bits of a byte. Still the code is relatively simple.
import numpy as np
array=np.array([True,False]*20)
Hash=np.packbits(array).tostring()
dict={}
dict[Hash]=10
print(np.unpackbits(np.fromstring(Hash,np.uint8)).astype(np.bool)[:len((array)])
Be careful with variable length bool arrays the code does not distinguish between an all False array of for example 6 or 7 members. For moredimensional arrays you will need some reshaping..
If this is still not efficient enough, and your arrays are large, you might be able to reduce the memory further by packing:
import bz2
Hash_compressed=bz2.compress(Hash,1)
It does not work for random, uncompressible data though
I have three suggestions. My first is baldly stolen from aix. The problem is that bitarray
objects are mutable, and their hash
es are content-independent (i.e. for bitarray b
, hash(b) == id(b)
). This can be worked around, as aix's answer shows, but in fact you don't need bitarray
s at all -- you can just use tostring
!
In [1]: import numpy
In [2]: a = numpy.arange(25).reshape((5, 5))
In [3]: (a > 10).tostring()
Out[3]: '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x01
\x01\x01\x01\x01\x01\x01\x01\x01\x01\x01\x01\x01'
Now we have an immutable string of bytes, perfectly suitable for use as a dictionary key. To be clear, note that those escapes aren't escaped, so this is as compact as you can get without bitstring
-style serialization.
In [4]: len((a > 10).tostring())
Out[4]: 25
Converting back is easy and fast:
In [5]: numpy.fromstring((a > 10).tostring(), dtype=bool).reshape(5, 5)
Out[5]:
array([[False, False, False, False, False],
[False, False, False, False, False],
[False, True, True, True, True],
[ True, True, True, True, True],
[ True, True, True, True, True]], dtype=bool)
In [6]: %timeit numpy.fromstring((a > 10).tostring(), dtype=bool).reshape(5, 5)
100000 loops, best of 3: 5.75 us per loop
Like aix, I was unable to figure out how to store dimension information in a simple way. If you must have that, then you may have to put up with longer keys. cPickle
seems like a good choice though. Still, its output is 10x as big...
In [7]: import cPickle
In [8]: len(cPickle.dumps(a > 10))
Out[8]: 255
It's also slower:
In [9]: cPickle.loads(cPickle.dumps(a > 10))
Out[9]:
array([[False, False, False, False, False],
[False, False, False, False, False],
[False, True, True, True, True],
[ True, True, True, True, True],
[ True, True, True, True, True]], dtype=bool)
In [10]: %timeit cPickle.loads(cPickle.dumps(a > 10))
10000 loops, best of 3: 45.8 us per loop
My third suggestion uses bitstring
s -- specifically, bitstring.ConstBitArray
. It's similar in spirit to aix
's solution, but ConstBitArray
s are immutable, so they do what you want, hash
-wise.
In [11]: import bitstring
You have to flatten the numpy array explicitly:
In [12]: b = bitstring.ConstBitArray((a > 10).flat)
In [13]: b.bin
Out[13]: '0b0000000000011111111111111'
It's immutable so it hashes well:
In [14]: hash(b)
Out[14]: 12144
It's super-easy to convert back into an array, but again, shape information is lost.
In [15]: numpy.array(b).reshape(5, 5)
Out[15]:
array([[False, False, False, False, False],
[False, False, False, False, False],
[False, True, True, True, True],
[ True, True, True, True, True],
[ True, True, True, True, True]], dtype=bool)
It's also the slowest option by far:
In [16]: %timeit numpy.array(b).reshape(5, 5)
1000 loops, best of 3: 240 us per loop
Here's some more information. I kept fiddling around and testing things and came up with the following. First, bitarray
is way faster than bitstring
when you use it right:
In [1]: %timeit numpy.array(bitstring.ConstBitArray(a.flat)).reshape(5, 5)
1000 loops, best of 3: 283 us per loop
In [2]: %timeit numpy.array(bitarray.bitarray(a.flat)).reshape(5, 5)
10000 loops, best of 3: 19.9 us per loop
Second, as you can see from the above, all the tostring
shenanigans are unnecessary; you could also just explicitly flatten the numpy
array. But actually, aix's method is faster, so that's what the now-revised numbers below are based on.
So here's a full rundown of the results. First, definitions:
small_nda = numpy.arange(25).reshape(5, 5) > 10
big_nda = numpy.arange(10000).reshape(100, 100) > 5000
small_barray = bitarray.bitarray(small_nda.flat)
big_barray = bitarray.bitarray(big_nda.flat)
small_bstr = bitstring.ConstBitArray(small_nda.flat)
big_bstr = bitstring.ConstBitArray(big_nda.flat)
keysize
is the result of sys.getsizeof({small|big}_nda.tostring())
, sys.getsizeof({small|big}_barray) + sys.getsizeof({small|big}_barray.tostring())
, or sys.getsizeof({small|big}_bstr) + sys.getsizeof({small|big}_bstr.tobytes())
-- both the latter methods return bitstrings packed into bytes, so they should be good estimates of the space taken by each.
speed
is the time it takes to convert from {small|big}_nda
to a key and back, plus the time it takes to convert a bitarray
object into a string for hashing, which is either a one-time cost if you cache the string or a cost per dict operation if you don't cache it.
small_nda big_nda small_barray big_barray small_bstr big_bstr
keysize 64 10040 148 1394 100 1346
speed 2.05 us 3.15 us 3.81 us 96.3 us 277 us 92.2ms
+ 161 ns + 257 ns
As you can see, bitarray
is impressively fast, and aix's suggestion of a subclass of bitarray
should work well. Certainly it's a lot faster than bitstring
. Glad to see that you accepted that answer.
On the other hand, I still feel attached to the numpy.array.tostring()
method. The keys it generates are, asymptotically, 8x as large, but the speedup you get for big arrays remains substantial -- about 30x on my machine for large arrays. It's a good tradeoff. Still, it's probably not enough to bother with until it becomes the bottleneck.
Initially, I suggested using bitarray
. However, as rightly pointed out by @senderle, since bitarray
is mutable, it can't be used to directly key into a dict
.
Here is a revised solution (still based on bitarray
internally):
import bitarray
class BoolArray(object):
# create from an ndarray
def __init__(self, array):
ba = bitarray.bitarray()
ba.pack(array.tostring())
self.arr = ba.tostring()
self.shape = array.shape
self.size = array.size
# convert back to an ndarray
def to_array(self):
ba = bitarray.bitarray()
ba.fromstring(self.arr)
ret = np.fromstring(ba.unpack(), dtype=np.bool)[:self.size]
return ret.reshape(self.shape)
def __cmp__(self, other):
return cmp(self.arr, other.arr)
def __hash__(self):
return hash(self.arr)
import numpy as np
x = (np.random.random((2,3,2))>0.5)
b1 = BoolArray(x)
b2 = BoolArray(x)
d = {b1: 12}
d[b2] += 1
print d
print b1.to_array()
This works with Python 2.5+, requires one bit per array element and supports arrays of any shape/dimensions.
EDIT: In the recent versions, you have to replace the ba.tostring
and ba.fromstring
to ba.tobytes
and ba.frombytes
(Deprecated since version 0.4.0).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With