Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Memory efficient int-int dict in Python

I need a memory efficient int-int dict in Python that would support the following operations in O(log n) time:

d[k] = v  # replace if present
v = d[k]  # None or a negative number if not present

I need to hold ~250M pairs, so it really has to be tight.

Do you happen to know a suitable implementation (Python 2.7)?

EDIT Removed impossible requirement and other nonsense. Thanks, Craig and Kylotan!


To rephrase. Here's a trivial int-int dictionary with 1M pairs:

>>> import random, sys
>>> from guppy import hpy
>>> h = hpy()
>>> h.setrelheap()
>>> d = {}
>>> for _ in xrange(1000000):
...     d[random.randint(0, sys.maxint)] = random.randint(0, sys.maxint)
... 
>>> h.heap()
Partition of a set of 1999530 objects. Total size = 49161112 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0      1   0 25165960  51  25165960  51 dict (no owner)
     1 1999521 100 23994252  49  49160212 100 int

On average, a pair of integers uses 49 bytes.

Here's an array of 2M integers:

>>> import array, random, sys
>>> from guppy import hpy
>>> h = hpy()
>>> h.setrelheap()
>>> a = array.array('i')
>>> for _ in xrange(2000000):
...     a.append(random.randint(0, sys.maxint))
... 
>>> h.heap()
Partition of a set of 14 objects. Total size = 8001108 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0      1   7  8000028 100   8000028 100 array.array

On average, a pair of integers uses 8 bytes.

I accept that 8 bytes/pair in a dictionary is rather hard to achieve in general. Rephrased question: is there a memory-efficient implementation of int-int dictionary that uses considerably less than 49 bytes/pair?

like image 350
Bolo Avatar asked Oct 26 '10 11:10

Bolo


People also ask

Are Python dictionaries memory efficient?

The Python dictionary implementation consumes a surprisingly small amount of memory. But the space taken by the many int and (in particular) string objects, for reference counts, pre-calculated hash codes etc., is more than you'd think at first.

How much memory does a dict take Python?

This sums up to at least 12 bytes on a 32bit machine and 24 bytes on a 64bit machine. The dictionary starts with 8 empty buckets. This is then resized by doubling the number of entries whenever its capacity is reached.

How are Python dictionaries stored in memory?

The dictionary is stored in memory in the form of a hash table with an ordered array of ranges and their corresponding values. The dictionary key has the UInt64 type. This storage method works the same way as hashed and allows using date/time (arbitrary numeric type) ranges in addition to the key.

How fast is dict Python?

As far I have understood, python dict is a quite fast implementation of a hashtable. Is there anything better than that for my specific case? If your using Python 3.5 or lower, then the dictionary built in in Python 3.6 is said to be 20-25% faster than the old dictionary builtin in Python 3.5.


3 Answers

You could use the IIBtree from Zope

like image 189
John La Rooy Avatar answered Sep 28 '22 05:09

John La Rooy


I don't know if this is a one-shot solution, or part of an ongoing project, but if it's the former, is throwing more ram at it cheaper than the necessary developer time to optimize the memory usage? Even at 64 bytes per pair, you're still only looking at 15GB, which would fit easily enough into most desktop boxes.

I think the correct answer probably lies within the SciPy/NumPy libraries, but I'm not familiar enough with the library to tell you exactly where to look.

http://docs.scipy.org/doc/numpy/reference/

You might also find some useful ideas in this thread: Memory Efficient Alternatives to Python Dictionaries

like image 28
Paul McMillan Avatar answered Sep 28 '22 05:09

Paul McMillan


8 bytes per key/value pair would be pretty difficult under any implementation, Python or otherwise. If you don't have a guarantee that the keys are contiguous then either you'd waste a lot of space between the keys by using an array representation (as well as needing some sort of dead value to indicate a null key), or you'd need to maintain a separate index to key/value pairs which by definition would exceed your 8 bytes per pair (even if only by a small amount).

I suggest you go with your array method, but the best approach will depend on the nature of the keys I expect.

like image 41
Kylotan Avatar answered Sep 28 '22 06:09

Kylotan