I am using <code>defaultdict</code> to store millions of phrases, so my data structure looks like <code>mydict['string'] = set(['other', 'strings'])</code>. It seems to work ok for smaller sets but when I hit anything over 10 million keys, my program just crashes with the helpful message of <code>Process killed</code>. I know <code>defaultdict</code>s are memory heavy, but is there an optimised method of storing using <code>defaultdict</code>s or would I have to look at other data structures like numpy array? Thank you

If you're set on staying in memory with a single Python process, then you're going to have to abandon the <code>dict</code> datatype -- as you noted, it has excellent runtime performance characteristics, but it uses a bunch of memory to get you there. Really, I think @msw's comment and @Udi's answer are spot on -- to scale you ought to look at on-disk or at least out-of-process storage of some sort, probably an RDBMS is the easiest thing to get going. However, if you're sure that you need to stay in memory and in-process, I'd recommend using a sorted list to store your dataset. You can do lookups in O(log n) time and insertions and deletions in constant time, and you can wrap up the code for yourself so that the usage looks pretty much like a <code>defaultdict</code>. Something like this might help (not debugged beyond tests at the bottom): <pre class="prettyprint"><code>import bisect class mystore: def __init__(self, constructor): self.store = [] self.constructor = constructor self.empty = constructor() def __getitem__(self, key): i, k = self.lookup(key) if k == key: return v # key not present, create a new item for this key. value = self.constructor() self.store.insert(i, (key, value)) return value def __setitem__(self, key, value): i, k = self.lookup(key) if k == key: self.store[i] = (key, value) else: self.store.insert(i, (key, value)) def lookup(self, key): i = bisect.bisect(self.store, (key, self.empty)) if 0 <= i < len(self.store): return i, self.store[i][0] return i, None if __name__ == '__main__': s = mystore(set) s['a'] = set(['1']) print(s.store) s['b'] print(s.store) s['a'] = set(['2']) print(s.store) </code></pre>

Python defaultdict for large data sets

Tags:

python

large-data

numpy

defaultdict

I am using defaultdict to store millions of phrases, so my data structure looks like mydict['string'] = set(['other', 'strings']). It seems to work ok for smaller sets but when I hit anything over 10 million keys, my program just crashes with the helpful message of Process killed. I know defaultdicts are memory heavy, but is there an optimised method of storing using defaultdicts or would I have to look at other data structures like numpy array?

Thank you

985

asked Aug 03 '14 18:08

Lezan

1 Answers

If you're set on staying in memory with a single Python process, then you're going to have to abandon the dict datatype -- as you noted, it has excellent runtime performance characteristics, but it uses a bunch of memory to get you there.

Really, I think @msw's comment and @Udi's answer are spot on -- to scale you ought to look at on-disk or at least out-of-process storage of some sort, probably an RDBMS is the easiest thing to get going.

However, if you're sure that you need to stay in memory and in-process, I'd recommend using a sorted list to store your dataset. You can do lookups in O(log n) time and insertions and deletions in constant time, and you can wrap up the code for yourself so that the usage looks pretty much like a defaultdict. Something like this might help (not debugged beyond tests at the bottom):

import bisect

class mystore:
    def __init__(self, constructor):
        self.store = []
        self.constructor = constructor
        self.empty = constructor()

    def __getitem__(self, key):
        i, k = self.lookup(key)
        if k == key:
            return v
        # key not present, create a new item for this key.
        value = self.constructor()
        self.store.insert(i, (key, value))
        return value

    def __setitem__(self, key, value):
        i, k = self.lookup(key)
        if k == key:
            self.store[i] = (key, value)
        else:
            self.store.insert(i, (key, value))

    def lookup(self, key):
        i = bisect.bisect(self.store, (key, self.empty))
        if 0 <= i < len(self.store):
            return i, self.store[i][0]
        return i, None

if __name__ == '__main__':
    s = mystore(set)
    s['a'] = set(['1'])
    print(s.store)
    s['b']
    print(s.store)
    s['a'] = set(['2'])
    print(s.store)

151

answered Oct 27 '22 16:10

lmjohns3

Related questions
                            
                                What's the numpy equivalent of python's zip(*)?
                            
                                openpyxl python3 -- formatting whole rows ellicits strange behavior
                            
                                Speed up sampling of kernel estimate
                            
                                Using url_for across blueprints
                            
                                Boto S3 throws httplib.IncompleteRead occasionally
                            
                                How to pass additional arguments to custom python sorting function
                            
                                Using dateutil.parser to parse a date in another language
                            
                                Pandas Handling Missing Values when going from Data Frame to Pivot Table
                            
                                Reverse Levenshtein distance
                            
                                How to give Matplolib imshow plot colorbars a label
                            
                                What's the difference between kmeans and kmeans2 in scipy?
                            
                                numpy OpenBLAS set maximum number of threads
                            
                                Is there a reason to import the string module in Python?
                            
                                Check if data available in sockets in python
                            
                                Alter the style of all cells with openpyxl
                            
                                Heroku Python/Django applications all simultaneously developed ImportError
                            
                                Implementing complex number comparison in Python?
                            
                                How to get an XPath from selenium webelement or from lxml?
                            
                                Why does object.__new__ with arguments work fine in Python 2.x and not in Python 3.3+?
                            
                                How to handle categorical variables in sklearn GradientBoostingClassifier?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With