Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Loading a huge Python Pickle dictionary

Tags:

python

pickle

I generated by pickle.dump() a file with the size of about 5GB. It takes about half a day to load this file and about 50GM RAM. My question is whether is it possible to read this file by accessing separately entry by entry (one at a time) rather than loading it all into memory, or if you have any other suggestion of how to access data in such a file.

Many thanks.

like image 711
user1132834 Avatar asked Jan 05 '12 18:01

user1132834


People also ask

How do I store a large dictionary in Python?

If you just want to work with a larger dictionary than memory can hold, the shelve module is a good quick-and-dirty solution. It acts like an in-memory dict, but stores itself on disk rather than in memory. shelve is based on cPickle, so be sure to set your protocol to anything other than 0.

How do I load pickled data in Python?

Python Pickle load To retrieve pickled data, the steps are quite simple. You have to use pickle. load() function to do that. The primary argument of pickle load function is the file object that you get by opening the file in read-binary (rb) mode.

How big can a dictionary be in Python?

There is in principle no size limitation to a dictionary in Python, except the capacity of your available memory (RAM + Swap space).

Are pickles faster than JSON?

Speed: Pickle is slow, JSON is fast, because of the serialization method. Security: Pickle is not secure, JSON is. Only deserialize pickled data that you trust; being binary code, it can trigger function calls that may be malicious.


1 Answers

There is absolutely no question that this should be done using a database, rather than pickle- databases are designed for exactly this kind of problem.

Here is some code to get you started, which puts a dictionary into a sqllite database and shows an example of retrieving a value. To get this to work with your actual dictionary rather than my toy example, you'll need to learn more about SQL, but fortunately there are many excellent resources available online. In particular, you might want to learn how to use SQLAlchemy, which is an "Object Relational Mapper" that can make working with databases as intuitive as working with objects.

import os
import sqlite3

# an enormous dictionary too big to be stored in pickle
my_huge_dictionary = {"A": 1, "B": 2, "C": 3, "D": 4}

# create a database in the file my.db
conn = sqlite3.connect('my.db')
c = conn.cursor()

# Create table with two columns: k and v (for key and value). Here your key
# is assumed to be a string of length 10 or less, and your value is assumed
# to be an integer. I'm sure this is NOT the structure of your dictionary;
# you'll have to read into SQL data types
c.execute("""
create table dictionary (
k char[10] NOT NULL,
v integer NOT NULL,
PRIMARY KEY (k))
""")

# dump your enormous dictionary into a database. This will take a while for
# your large dictionary, but you should do it only once, and then in the future
# make changes to your database rather than to a pickled file.
for k, v in my_huge_dictionary.items():
    c.execute("insert into dictionary VALUES ('%s', %d)" % (k, v))

# retrieve a value from the database
my_key = "A"
c.execute("select v from dictionary where k == '%s'" % my_key)
my_value = c.next()[0]
print my_value

Good luck!

like image 80
David Robinson Avatar answered Oct 15 '22 21:10

David Robinson