I need a memory efficient int-int dict in Python that would support the following operations in O(log n) time:
d[k] = v # replace if present
v = d[k] # None or a negative number if not present
I need to hold ~250M pairs, so it really has to be tight.
Do you happen to know a suitable implementation (Python 2.7)?
EDIT Removed impossible requirement and other nonsense. Thanks, Craig and Kylotan!
To rephrase. Here's a trivial int-int dictionary with 1M pairs:
>>> import random, sys
>>> from guppy import hpy
>>> h = hpy()
>>> h.setrelheap()
>>> d = {}
>>> for _ in xrange(1000000):
... d[random.randint(0, sys.maxint)] = random.randint(0, sys.maxint)
...
>>> h.heap()
Partition of a set of 1999530 objects. Total size = 49161112 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 1 0 25165960 51 25165960 51 dict (no owner)
1 1999521 100 23994252 49 49160212 100 int
On average, a pair of integers uses 49 bytes.
Here's an array of 2M integers:
>>> import array, random, sys
>>> from guppy import hpy
>>> h = hpy()
>>> h.setrelheap()
>>> a = array.array('i')
>>> for _ in xrange(2000000):
... a.append(random.randint(0, sys.maxint))
...
>>> h.heap()
Partition of a set of 14 objects. Total size = 8001108 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 1 7 8000028 100 8000028 100 array.array
On average, a pair of integers uses 8 bytes.
I accept that 8 bytes/pair in a dictionary is rather hard to achieve in general. Rephrased question: is there a memory-efficient implementation of int-int dictionary that uses considerably less than 49 bytes/pair?
The Python dictionary implementation consumes a surprisingly small amount of memory. But the space taken by the many int and (in particular) string objects, for reference counts, pre-calculated hash codes etc., is more than you'd think at first.
This sums up to at least 12 bytes on a 32bit machine and 24 bytes on a 64bit machine. The dictionary starts with 8 empty buckets. This is then resized by doubling the number of entries whenever its capacity is reached.
The dictionary is stored in memory in the form of a hash table with an ordered array of ranges and their corresponding values. The dictionary key has the UInt64 type. This storage method works the same way as hashed and allows using date/time (arbitrary numeric type) ranges in addition to the key.
As far I have understood, python dict is a quite fast implementation of a hashtable. Is there anything better than that for my specific case? If your using Python 3.5 or lower, then the dictionary built in in Python 3.6 is said to be 20-25% faster than the old dictionary builtin in Python 3.5.
You could use the IIBtree from Zope
I don't know if this is a one-shot solution, or part of an ongoing project, but if it's the former, is throwing more ram at it cheaper than the necessary developer time to optimize the memory usage? Even at 64 bytes per pair, you're still only looking at 15GB, which would fit easily enough into most desktop boxes.
I think the correct answer probably lies within the SciPy/NumPy libraries, but I'm not familiar enough with the library to tell you exactly where to look.
http://docs.scipy.org/doc/numpy/reference/
You might also find some useful ideas in this thread: Memory Efficient Alternatives to Python Dictionaries
8 bytes per key/value pair would be pretty difficult under any implementation, Python or otherwise. If you don't have a guarantee that the keys are contiguous then either you'd waste a lot of space between the keys by using an array representation (as well as needing some sort of dead value to indicate a null key), or you'd need to maintain a separate index to key/value pairs which by definition would exceed your 8 bytes per pair (even if only by a small amount).
I suggest you go with your array method, but the best approach will depend on the nature of the keys I expect.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With