Memory efficient int-int dict in Python

Tags:

I need a memory efficient int-int dict in Python that would support the following operations in O(log n) time:

d[k] = v  # replace if present
v = d[k]  # None or a negative number if not present

I need to hold ~250M pairs, so it really has to be tight.

Do you happen to know a suitable implementation (Python 2.7)?

EDIT Removed impossible requirement and other nonsense. Thanks, Craig and Kylotan!

To rephrase. Here's a trivial int-int dictionary with 1M pairs:

Click to copy

>>> import random, sys
>>> from guppy import hpy
>>> h = hpy()
>>> h.setrelheap()
>>> d = {}
>>> for _ in xrange(1000000):
...     d[random.randint(0, sys.maxint)] = random.randint(0, sys.maxint)
... 
>>> h.heap()
Partition of a set of 1999530 objects. Total size = 49161112 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0      1   0 25165960  51  25165960  51 dict (no owner)
     1 1999521 100 23994252  49  49160212 100 int

On average, a pair of integers uses 49 bytes.

Here's an array of 2M integers:

Click to copy

>>> import array, random, sys
>>> from guppy import hpy
>>> h = hpy()
>>> h.setrelheap()
>>> a = array.array('i')
>>> for _ in xrange(2000000):
...     a.append(random.randint(0, sys.maxint))
... 
>>> h.heap()
Partition of a set of 14 objects. Total size = 8001108 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0      1   7  8000028 100   8000028 100 array.array

On average, a pair of integers uses 8 bytes.

I accept that 8 bytes/pair in a dictionary is rather hard to achieve in general. Rephrased question: is there a memory-efficient implementation of int-int dictionary that uses considerably less than 49 bytes/pair?

350

asked Oct 26 '10 11:10

Bolo

3 Answers

You could use the IIBtree from Zope

189

answered Sep 28 '22 05:09

John La Rooy

I don't know if this is a one-shot solution, or part of an ongoing project, but if it's the former, is throwing more ram at it cheaper than the necessary developer time to optimize the memory usage? Even at 64 bytes per pair, you're still only looking at 15GB, which would fit easily enough into most desktop boxes.

I think the correct answer probably lies within the SciPy/NumPy libraries, but I'm not familiar enough with the library to tell you exactly where to look.

http://docs.scipy.org/doc/numpy/reference/

You might also find some useful ideas in this thread: Memory Efficient Alternatives to Python Dictionaries

answered Sep 28 '22 05:09

Paul McMillan

8 bytes per key/value pair would be pretty difficult under any implementation, Python or otherwise. If you don't have a guarantee that the keys are contiguous then either you'd waste a lot of space between the keys by using an array representation (as well as needing some sort of dead value to indicate a null key), or you'd need to maintain a separate index to key/value pairs which by definition would exceed your 8 bytes per pair (even if only by a small amount).

I suggest you go with your array method, but the best approach will depend on the nature of the keys I expect.

answered Sep 28 '22 06:09

Kylotan

Related questions
                            
                                Is there cudnnLSTM or cudNNGRU alternative in tensorflow 2.0
                            
                                What is running loss in PyTorch and how is it calculated
                            
                                SyntaxError: f-string expression part cannot include a backslash
                            
                                Is it ever polite to put code in a python configuration file?
                            
                                Detect URLs in a string and wrap with "<a href..." tag
                            
                                Alternative XML parser for ElementTree to ease UTF-8 woes?
                            
                                Accessing elements with offsets in Python's for .. in loops
                            
                                How do I copy wsgi.input if I want to process POST data more than once?
                            
                                How to call ssh by subprocess module so that it uses SSH_ASKPASS variable
                            
                                Optimized method for calculating cosine distance in Python
                            
                                Create thumbnail images for jpegs with python
                            
                                Python large variable RAM usage
                            
                                Python hash() can't handle long integer?
                            
                                is this a correct way to generate rsa keys?
                            
                                Wait for a single RabbitMQ message with a timeout
                            
                                conditional `ctypedef` with Cython
                            
                                Python: Strange behaviour of recursive function with keyword arguments
                            
                                django user model and custom primary key field
                            
                                Issue with Django admin registering an inline user profile admin
                            
                                Tuple to string

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Memory efficient int-int dict in Python

Tags:

python

dictionary

data-structures

memory-optimization