Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reducing memory used by a large dict

I need to create an in memory object that has keys that are a 9 digit integer and a boolean value associated with each key. I've been using a dict as in the simplified example below:

#!/usr/bin/python
from __future__ import print_function
import sys

myDict = {}
for n in range(56000):
        myDict[n] = True

print('Count:',len(myDict),' Size:', sys.getsizeof(myDict))

I need to be able to look up and retrieve the boolean value associated with each key. The problem is the size of the dict. Using Python 2.7 on a 64 bit Linux system and the above example, the size of the dict is 3.1 megabytes according to sys.getsizeof(). (about 56 bytes per entry to store 9 digits plus a boolean value)

I need to store the boolean state of (approx) 55.000 entries in the dict. Each dict key is a 9 digit integer. I've tried using the integer and str(theInteger) as keys with no change in the size of the dict.

Is there some other sort of data structure or methodology I should be using to conserve memory with such a large data set?

like image 908
RoyHB Avatar asked Sep 04 '13 13:09

RoyHB


People also ask

How much memory does a dict take Python?

This sums up to at least 12 bytes on a 32bit machine and 24 bytes on a 64bit machine. The dictionary starts with 8 empty buckets. This is then resized by doubling the number of entries whenever its capacity is reached.

Are Python dictionaries memory efficient?

The Python dictionary implementation consumes a surprisingly small amount of memory. But the space taken by the many int and (in particular) string objects, for reference counts, pre-calculated hash codes etc., is more than you'd think at first.

Why Python memory consumption is high?

Python does not free memory back to the system immediately after it destroys some object instance. It has some object pools, called arenas, and it takes a while until those are released. In some cases, you may be suffering from memory fragmentation which also causes process' memory usage to grow. sys.

Is there a way to reduce the memory used by dicts?

In disk you have just strings, when loading to Python the interpreter has to create a whole structure for each string and for each dictionary, besides the string itself. There's no way to reduce the memory used by the dicts, but there are other ways to approach the problem.

Is there a way to reduce the memory of a dictionary?

There's no way to reduce the memory used by the dicts, but there are other ways to approach the problem. If you're willing to trade some speed for memory, you should consider storing and querying the strings from an SQLite file instead of loading everything to dictionaries in memory.

How to reduce memory usage of a Python program?

So you can reduce memory usage even further, to about 8MB, by using a Pandas DataFrame to store the information: it will use NumPy arrays to efficiently store the numbers internally. In general, storing too many Python objects at once will waste memory. As always, solutions can involve compression, batching, or indexing:

Do Python dictionaries use a lot of memory?

Not bad; given how often dictionaries are used in Python, it’s good to know that they don’t normally consume that much memory. What if I add something to the dict? What will happen to the memory usage? Something seems a bit fishy here, right?


2 Answers

If you look up your boolean with an integer key, and the range of keys starts with 0 and is continuous, there's really no reason not to use a list:

my_list = []
for n in range(56000):
        my_list[n] = True

or better:

my_list = [True for n in range(5600])

If that is not enough, try the array module and use one byte per bool:

import array
my_array = array.array("b", (True for n in range(56000)))

And if that is not good enough, try the bitarray module on PyPi.

Another idea is to use a set: Say you have many more False than True, simply have a set:

my_true_numbers = {0, 2323, 23452} # just the True ones

and check with

value = number in my_true_numbers

If you have more True than False, do it the other way round.

like image 66
Christian Schramm Avatar answered Nov 22 '22 10:11

Christian Schramm


The accepted answer of Python: Reducing memory usage of dictionary has the conclusion that there it not much you can do and i agree with it. The overall memory overhead of a dictionary is small but the number of key-value pairs of your example raises the memory footprint.

There might one think possible: If the keys are always linear you could create a list of booleans directly or better use bitarray. The keys would then be implicit. But if this is only in your example the case you can't do much.

like image 33
tea2code Avatar answered Nov 22 '22 11:11

tea2code