Memory Efficient Alternatives to Python Dictionaries

Tags:

In one of my current side projects, I am scanning through some text looking at the frequency of word triplets. In my first go at it, I used the default dictionary three levels deep. In other words, topDict[word1][word2][word3] returns the number of times these words appear in the text, topDict[word1][word2] returns a dictionary with all the words that appeared following words 1 and 2, etc.

This functions correctly, but it is very memory intensive. In my initial tests it used something like 20 times the memory of just storing the triplets in a text file, which seems like an overly large amount of memory overhead.

My suspicion is that many of these dictionaries are being created with many more slots than are actually being used, so I want to replace the dictionaries with something else that is more memory efficient when used in this manner. I would strongly prefer a solution that allows key lookups along the lines of the dictionaries.

From what I know of data structures, a balanced binary search tree using something like red-black or AVL would probably be ideal, but I would really prefer not to implement them myself. If possible, I'd prefer to stick with standard python libraries, but I'm definitely open to other alternatives if they would work best.

So, does anyone have any suggestions for me?

Edited to add:

Thanks for the responses so far. A few of the answers so far have suggested using tuples, which didn't really do much for me when I condensed the first two words into a tuple. I am hesitant to use all three as a key since I want it to be easy to look up all third words given the first two. (i.e. I want something like the result of topDict[word1, word2].keys()).

The current dataset I am playing around with is the most recent version of Wikipedia For Schools. The results of parsing the first thousand pages, for example, is something like 11MB for a text file where each line is the three words and the count all tab separated. Storing the text in the dictionary format I am now using takes around 185MB. I know that there will be some additional overhead for pointers and whatnot, but the difference seems excessive.

497

asked Nov 29 '08 05:11

ricree

1 Answers

Some measurements. I took 10MB of free e-book text and computed trigram frequencies, producing a 24MB file. Storing it in different simple Python data structures took this much space in kB, measured as RSS from running ps, where d is a dict, keys and freqs are lists, and a,b,c,freq are the fields of a trigram record:

295760     S. Lott's answer 237984     S. Lott's with keys interned before passing in 203172 [*] d[(a,b,c)] = int(freq) 203156     d[a][b][c] = int(freq) 189132     keys.append((a,b,c)); freqs.append(int(freq)) 146132     d[intern(a),intern(b)][intern(c)] = int(freq) 145408     d[intern(a)][intern(b)][intern(c)] = int(freq)  83888 [*] d[a+' '+b+' '+c] = int(freq)  82776 [*] d[(intern(a),intern(b),intern(c))] = int(freq)  68756     keys.append((intern(a),intern(b),intern(c))); freqs.append(int(freq))  60320     keys.append(a+' '+b+' '+c); freqs.append(int(freq))  50556     pair array  48320     squeezed pair array  33024     squeezed single array

The entries marked [*] have no efficient way to look up a pair (a,b); they're listed only because others have suggested them (or variants of them). (I was sort of irked into making this because the top-voted answers were not helpful, as the table shows.)

'Pair array' is the scheme below in my original answer ("I'd start with the array with keys being the first two words..."), where the value table for each pair is represented as a single string. 'Squeezed pair array' is the same, leaving out the frequency values that are equal to 1 (the most common case). 'Squeezed single array' is like squeezed pair array, but gloms key and value together as one string (with a separator character). The squeezed single array code:

import collections  def build(file):     pairs = collections.defaultdict(list)     for line in file:  # N.B. file assumed to be already sorted         a, b, c, freq = line.split()         key = ' '.join((a, b))         pairs[key].append(c + ':' + freq if freq != '1' else c)     out = open('squeezedsinglearrayfile', 'w')     for key in sorted(pairs.keys()):         out.write('%s|%s\n' % (key, ' '.join(pairs[key])))  def load():     return open('squeezedsinglearrayfile').readlines()  if __name__ == '__main__':     build(open('freqs'))

I haven't written the code to look up values from this structure (use bisect, as mentioned below), or implemented the fancier compressed structures also described below.

Original answer: A simple sorted array of strings, each string being a space-separated concatenation of words, searched using the bisect module, should be worth trying for a start. This saves space on pointers, etc. It still wastes space due to the repetition of words; there's a standard trick to strip out common prefixes, with another level of index to get them back, but that's rather more complex and slower. (The idea is to store successive chunks of the array in a compressed form that must be scanned sequentially, along with a random-access index to each chunk. Chunks are big enough to compress, but small enough for reasonable access time. The particular compression scheme applicable here: if successive entries are 'hello george' and 'hello world', make the second entry be '6world' instead. (6 being the length of the prefix in common.) Or maybe you could get away with using zlib? Anyway, you can find out more in this vein by looking up dictionary structures used in full-text search.) So specifically, I'd start with the array with keys being the first two words, with a parallel array whose entries list the possible third words and their frequencies. It might still suck, though -- I think you may be out of luck as far as batteries-included memory-efficient options.

Also, binary tree structures are not recommended for memory efficiency here. E.g., this paper tests a variety of data structures on a similar problem (unigrams instead of trigrams though) and finds a hashtable to beat all of the tree structures by that measure.

I should have mentioned, as someone else did, that the sorted array could be used just for the wordlist, not bigrams or trigrams; then for your 'real' data structure, whatever it is, you use integer keys instead of strings -- indices into the wordlist. (But this keeps you from exploiting common prefixes except in the wordlist itself. Maybe I shouldn't suggest this after all.)

answered Sep 22 '22 17:09

13 revs, 2 users 99%

Related questions
                            
                                Do overridden methods inherit decorators in python?
                            
                                Genetic Algorithms and multi-objectives optimization on PYTHON : libraries/tools to use? [closed]
                            
                                Django Selective Dumpdata
                            
                                Please explain "Task was destroyed but it is pending!"
                            
                                How can I find the full path to a font from its display name on a Mac?
                            
                                A comparison between fastparquet and pyarrow?
                            
                                Best way to encode tuples with json
                            
                                Matplotlib savefig with a legend outside the plot
                            
                                Celery Worker Database Connection Pooling
                            
                                If x is list, why does x += "ha" work, while x = x + "ha" throws an exception?
                            
                                What does "Symbol not found / Expected in: flat namespace" actually mean?
                            
                                Python 3 dictionary with known keys typing
                            
                                Cross-platform desktop notifier in Python
                            
                                Convert JSON to SQLite in Python - How to map json keys to database columns properly?
                            
                                How to export figures to files from IPython Notebook
                            
                                Visual Studio Code: run Python file with arguments
                            
                                Python: Semantic similarity score for Strings [duplicate]
                            
                                Pyodbc - "Data source name not found, and no default driver specified"
                            
                                Why would running scheduled tasks with Celery be preferable over crontab?
                            
                                Stop Pandas from converting int to float

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Memory Efficient Alternatives to Python Dictionaries

Tags:

python

memory

data-structures

ricree

People also ask

1 Answers

13 revs, 2 users 99%

Recent Activity

Donate For Us