Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to sort a LARGE dictionary

Tags:

python

I have a python script that is working with a large (~14gb) textfile. I end up with a dictionary of keys and values, but I am getting a memory error when I try to sort the dictionary by value.

I know the dictionary is too big to load into memory and then sort, but how could I go about accomplishing this?

like image 599
Joff Avatar asked Oct 30 '22 06:10

Joff


1 Answers

You can use an ordered key/value store like wiredtiger, leveldb, bsddb. All of them support ordered keys using custom sort function. leveldb is the easiest to use but if you use python 2.7, bsddb is included in the stdlib. If you only require lexicographic sorting you can use the raw hashopen function to open a persistent sorted dictionary:

from bsddb import hashopen


db = hashopen('dict.db')
db['020'] = 'twenty'
db['002'] = 'two'
db['value'] = 'value'
db['key'] = 'key'

print(db.keys())

This outputs

>>> ['002', '020', 'key', 'value']

Don't forget to close the db after your work:

db.close()

Mind the fact that hashopen configuration might not suit your need. In this case I recommend you use leveldb which has a simple API or wiredtiger for speed.

To order by value in bsddb, you have to use the composite key pattern or key composition. Which boils down to create a dictionary key which keeps the ordering you look for. In this example we pack the original dict value first (so that small values appears first) with the original dict key (so that the bsddb key is unique):

import struct
from bsddb import hashopen

my_dict = {'a': 500, 'abc': 100, 'foobar': 1}

# insert
db = hashopen('dict.db')
for key, value in my_dict.iteritems():
    composite_key = struct.pack('>Q', value) + key
    db[composite_key] = ''  # value is not useful in this case but required
db.close()


# read
db = hashopen('dict.db')
for key, _ in db.iteritems():  # iterate over database
    size = struct.calcsize('>Q')
    # unpack
    value, key = key[:size], key[size:]
    value = struct.unpack('>Q', value)[0]
    print key, value
db.close()

This outputs the following:

foobar 1
abc 100
a 500
like image 173
amirouche Avatar answered Nov 15 '22 03:11

amirouche