I am trying to reduce the memory-consumption of a python dict, which in my case serves as a word-->document_id
"inverted index". Each word
is hashed as an integer, which takes up 24 bytes.
I was wondering if I can convert each element within dict
's values and each key within dict
to a bitarray instead. I've noticed that the max value of any encountered int
is less than 2^22
, so I can maybe just allocate a bit-array of "size 22".
How can this be done? So far I've seen gmpy2
and bitarray
libraries, as well as std::bitset
in the C++ stdlib, which I can use with Cython. I've read from this post that bitarray
is not as fast as gmpy
. In gmpy
, I am not sure how to set the size. Finally, I wonder if the memory-overhead of gmpy
or bitarray
objects in Python is worth it, when I can just use std::bitset
, which probably uses the least memory of all.
>>> sys.getsizeof(1)
24
That's 24 bytes, just for an integer, on my machine. For Python objects, object overhead is going to swamp all other costs for such small values. Even if you could shave off 10 bits (which you can't; memory allocation doesn't work that way), it's peanuts compared to the refcount, the type pointer, and the dict entry itself (which isn't included in that 24 number), even before you get to the actual data.
bitarray
isn't going to help; it's probably bigger than an int. I don't know about std::bitset
, since I'm not sure what overhead Cython adds to it, but it's going to have overhead for things like the bit count. If anything, I'd expect a Cython int to work best, but it probably needs to be converted to a regular Python int to go in a dict.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With