I am using defaultdict
to store millions of phrases, so my data structure looks like mydict['string'] = set(['other', 'strings'])
. It seems to work ok for smaller sets but when I hit anything over 10 million keys, my program just crashes with the helpful message of Process killed
. I know defaultdict
s are memory heavy, but is there an optimised method of storing using defaultdict
s or would I have to look at other data structures like numpy array?
Thank you
Finally, using a defaultdict to handle missing keys can be faster than using dict.
It depends on the data; setdefault is faster and simpler with small data sets; defaultdict is faster for larger data sets with more homogenous key sets (ie, how short the dict is after adding elements);
defaultdict fills a specialized use-case, where you want values to be auto-created for you when a key is missing, and those values can be generated with a factory function without access to key being inserted. Note that exceptions are not just there to flag programmer error.
A defaultdict works exactly like a normal dict, but it is initialized with a function (“default factory”) that takes no arguments and provides the default value for a nonexistent key. A defaultdict will never raise a KeyError. Any key that does not exist gets the value returned by the default factory.
If you're set on staying in memory with a single Python process, then you're going to have to abandon the dict
datatype -- as you noted, it has excellent runtime performance characteristics, but it uses a bunch of memory to get you there.
Really, I think @msw's comment and @Udi's answer are spot on -- to scale you ought to look at on-disk or at least out-of-process storage of some sort, probably an RDBMS is the easiest thing to get going.
However, if you're sure that you need to stay in memory and in-process, I'd recommend using a sorted list to store your dataset. You can do lookups in O(log n) time and insertions and deletions in constant time, and you can wrap up the code for yourself so that the usage looks pretty much like a defaultdict
. Something like this might help (not debugged beyond tests at the bottom):
import bisect
class mystore:
def __init__(self, constructor):
self.store = []
self.constructor = constructor
self.empty = constructor()
def __getitem__(self, key):
i, k = self.lookup(key)
if k == key:
return v
# key not present, create a new item for this key.
value = self.constructor()
self.store.insert(i, (key, value))
return value
def __setitem__(self, key, value):
i, k = self.lookup(key)
if k == key:
self.store[i] = (key, value)
else:
self.store.insert(i, (key, value))
def lookup(self, key):
i = bisect.bisect(self.store, (key, self.empty))
if 0 <= i < len(self.store):
return i, self.store[i][0]
return i, None
if __name__ == '__main__':
s = mystore(set)
s['a'] = set(['1'])
print(s.store)
s['b']
print(s.store)
s['a'] = set(['2'])
print(s.store)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With