There are plenty of questions and discussion about memory consumption of different python data types. Yet few of them (if any) come to a very specific scenario. When you want to store LOTS of key-value data in memory, which data structure is more memory-efficient, a dict or a list of tuples?
At beginning I thought dict is more powerful than list of tuples and that power must come with some price, and actually an empty dict DOES occupy more memory than an empty list or tuple (see In-memory size of a Python structure), so I thought using [(key1, value1), (key2, value2), ...]
would be more memory-efficient than {key1: value1, key2: value2, ...}
.
Looks like I was wrong. Just fire up the following code snippet, and see the mem consumption reported by your OS. I am using Windows XP so that task manager tells me, a large dict eats up "only" 40MB Ram and 40MB VIRTURAL Ram, but a list of tuples eats up 60MB Ram and 60MB Virtual ram.
How could that be?
from sys import getsizeof as g raw_input('ready, press ENTER') i = 1000000 #p = [(x, x) for x in xrange(i)] # Will print 4,348,736 40,348,736 p = dict((x, x) for x in xrange(i)) # Will print 25,165,964 37,165,964 print g(p), g(p) + sum(g(x) for x in p) raw_input("Check your process's memory consumption now, press ENTER to exit")
Update:
Thanks for some of the comments below. I wanna clarify: I'm talking about memory-efficiency. And no, in this case no need to worry about key-value lookup efficiency, let's just assume my algorithm will consume them one by one via iterator.
List has a large memory. Tuple is stored in a single block of memory. Creating a tuple is faster than creating a list. Creating a list is slower because two memory blocks need to be accessed.
Although dictionaries are optimized a lot more in Python 3.6, they still use more memory than lists, since you need to use space for the keys and the lookup as well, while lists use space only for the values.
Tuples are stored in a single block of memory. Tuples are immutable so, It doesn't require extra space to store new objects. Lists are allocated in two blocks: the fixed one with all the Python object information and a variable sized block for the data. It is the reason creating a tuple is faster than List.
Dictionary occupies much more space than a list of tuples. Even an empty dict occupies much space as compared to a list of tuples.
Your list
of tuple
s adds an extra layer. You have 3 layers of items:
while your dict
only holds:
It's those 1 million tuples plus the list to hold the references to them that take up more memory than the 1 million cached hashes. There are some 50% more pointers involved here, easily accounting for the 50% more memory use you see.
There is another downside to your list of tuples: lookup time. To find a matching key in the dict, there is a O(1) complexity cost. To do the same in the list of tuples, you have to potentially scan the whole list for a O(n) cost. Don't use a list of tuples if you need to map keys to values.
You're actually getting an incomplete picture of memory use in this case. The total size of a dictionary more than doubles at irregular intervals, and if you compare the size of these two structures right after the dictionary size has increased, it's bigger again. A simple script with a recursive size function (see code below) shows a pretty clear pattern:
i: 2 list size: 296 dict size: 328 difference: -32 i: 3 list size: 392 dict size: 352 difference: 40 i: 4 list size: 488 dict size: 376 difference: 112 i: 5 list size: 616 dict size: 400 difference: 216 i: 7 list size: 808 dict size: 1216 difference: -408 i: 10 list size: 1160 dict size: 1288 difference: -128 i: 13 list size: 1448 dict size: 1360 difference: 88 i: 17 list size: 1904 dict size: 1456 difference: 448 i: 23 list size: 2480 dict size: 3904 difference: -1424 i: 31 list size: 3328 dict size: 4096 difference: -768 i: 42 list size: 4472 dict size: 4360 difference: 112 i: 56 list size: 5912 dict size: 4696 difference: 1216 i: 74 list size: 7880 dict size: 5128 difference: 2752 i: 100 list size: 10520 dict size: 14968 difference: -4448 i: 133 list size: 14024 dict size: 15760 difference: -1736 i: 177 list size: 18672 dict size: 16816 difference: 1856
This pattern continues as i
grows. (You can test this using your method -- try setting i
near 2636744
. The size of the dictionary is larger at that point, at least for me.) Martijn is right that the tuples in the list of tuples add to the memory overhead, canceling out the memory advantage of lists over dictionaries. But the result, on average, is not that the dictionary is better; it's that the dictionary is about the same. So in answer to your original question:
When you want to store LOTS of key-value data in memory, which data structure is more memory-efficient, a dict or a list of tuples?
It doesn't really matter if all you're concerned about is memory.
However, note that iterating over a dictionary is often a bit slower than iterating over a list, because there's no good way to avoid iterating over all the empty bins in the dictionary. So there's a bit of a tradeoff -- dictionaries are (much) faster at doing random key lookups, but lists are (a bit) faster at iteration. The dictionary will probably be better most of the time, but in some rare cases, the list may provide a micro-optimization.
Here's the code that tests for size. It probably won't generate correct results for all corner cases, but it should handle simple structures like this without any problems. (But let me know if you see any issues.)
import sys, collections, itertools, math def totalsize(x): seen = set() return ts_rec(x, seen) def ts_rec(x, seen): if id(x) in seen: return 0 else: seen.add(id(x)) x_size = sys.getsizeof(x) if isinstance(x, collections.Mapping): kv_chain = itertools.chain.from_iterable(x.iteritems()) return x_size + sum(ts_rec(i, seen) for i in kv_chain) elif isinstance(x, collections.Sequence): return x_size + sum(ts_rec(i, seen) for i in x) else: return x_size for i in (10 ** (e / 8.0) for e in range(3, 19)): i = int(i) lsize = totalsize([(x, x) for x in xrange(i)]) dsize = totalsize(dict((x, x) for x in xrange(i))) print "i: ", i, print " list size: ", lsize, " dict size: ", dsize, print " difference: ", lsize - dsize
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With