Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a faster way to store a big dictionary, than pickle or regular Python file? [closed]

Tags:

python

pickle

I want to store a dictionary which only contains data in the following format:

{
    "key1" : True,
    "key2" : True,
    .....
}

In other words, just a quick way to check if a key is valid or not. I can do this by storing a dict called foo in a file called bar.py, and then in my other modules, I can import it as follows:

from bar import foo

Or, I can save it in a pickle file called bar.pickle, and import it at the top of the file as follows:

import pickle  
with open('bar.pickle', 'rb') as f:
    foo = pickle.load(f)

Which would be the ideal, and faster way to do this?

like image 649
darkhorse Avatar asked Jan 02 '19 19:01

darkhorse


People also ask

How do I store a large dictionary in Python?

If you just want to work with a larger dictionary than memory can hold, the shelve module is a good quick-and-dirty solution. It acts like an in-memory dict, but stores itself on disk rather than in memory. shelve is based on cPickle, so be sure to set your protocol to anything other than 0.

Why pickle is not good in Python?

Pickle constructs arbitrary Python objects by invoking arbitrary functions, that's why it is not secure. However, this enables it to serialise almost any Python object that JSON and other serialising methods will not do. Unpickling an object usually requires no “boilerplates”.

Which is faster JSON or pickle?

JSON is a lightweight format and is much faster than Pickling. There is always a security risk with Pickle. Unpickling data from unknown sources should be avoided as it may contain malicious or erroneous data. There are no loopholes in security using JSON, and it is free from security threats.

Is Python pickling slow?

Pickle on the other hand is slow, insecure, and can be only parsed in Python. The only real advantage to pickle is that it can serialize arbitrary Python objects, whereas both JSON and MessagePack have limits on the type of data they can write out.

Why is a dictionary faster than a list in Python?

It also explains the slight difference in indexing speed is faster than lists, because in tuples for indexing it follows fewer pointers. The reason behind the same is that Python implements dictionaries using hash tables. Dictionaries are Python’s built-in mapping type and so have also been highly optimized.

Why do Python dictionaries take up so much memory?

Because dictionaries are the built-in mapping type in Python thereby they are highly optimized. However, we have a typical space-time tradeoff in dictionaries and lists. It means we can decrease the time necessary for our algorithm but we need to use more space in memory.

How much faster is a dictionary lookup than a list lookup?

When it comes to 10,000,000 items a dictionary lookup can be 585714 times faster than a list lookup. 6.6 or 585714 are just the results of a simple test run with my computer.

What is pickle in Python and how does it work?

Pickle constructs arbitrary Python objects by invoking arbitrary functions, that’s why it is not secure. However, this enables it to serialise almost any Python object that JSON and other serialising methods will not do. Unpickling an object usually requires no “boilerplates”. So, it is very suitable for quick and easy serialisation.


2 Answers

To add to @scnerd's comment, here are the timings in IPython for different load situations.

Here we create a dictionary and write it to 3 formats:

import random
import json
import pickle

letters = 'abcdefghijklmnopqrstuvwxyz'
d = {''.join(random.choices(letters, k=6)): random.choice([True, False]) 
     for _ in range(100000)}

# write a python file
with open('mydict.py', 'w') as fp:
    fp.write('d = {\n')
    for k,v in d.items():
        fp.write(f"'{k}':{v},\n")
    fp.write('None:False}')

# write a pickle file
with open('mydict.pickle', 'wb') as fp:
    pickle.dump(d, fp)

# write a json file
with open('mydict.json', 'wb') as fp:
    json.dump(d, fp)

Python file:

# on first import the file will be cached.  
%%timeit -n1 -r1
from mydict import d

644 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

# after creating the __pycache__ folder, import is MUCH faster
%%timeit
from mydict import d

1.37 µs ± 54.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

pickle file:

%%timeit
with open('mydict.pickle', 'rb') as fp:
    pickle.load(fp)

52.4 ms ± 1.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

json file:

%%timeit
with open('mydict.json', 'rb') as fp:
    json.load(fp)

81.3 ms ± 2.21 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# here is the same test with ujson
import ujson

%%timeit
with open('mydict.json', 'rb') as fp:
    ujson.load(fp)

51.2 ms ± 304 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
like image 60
James Avatar answered Nov 23 '22 23:11

James


Python File

Using a python file will easily cache the dictionary, so that if you "import" it multiple times, it only has to be parsed once. However, python syntax is complicated, and so the parser that loads the file may not be well optimized for the limited complexity of the data you're saving (unless you're including arbitrary Python objects and code). It's easy to view and edit, and easy to use, but it's not easy to transport.

EDIT: to clarify, raw Python files are easy for a human to modify, but very hard for a computer to edit. If your code edits the data and you ever want that to be reflected in the dictionary, you're pretty much up a creek: instead, use one of the methods below.

Pickle File

If you use a pickle file, you'd either re-load the file each time you use it, or need some management code to cache the file after reading it the first time. Like arbitrary Python code, pickle files can be quite complex and the loader for them might not be optimized for your particular data types since, like raw python files, they can also store most arbitrary Python objects. However, they're hard to edit and view for a regular human, and you might encounter portability issues if you move the data around. It's also only readable by Python, and you need to consider the security considerations of using pickle, since loading pickle files can be risky and should only be done with trusted files.

JSON File

If all you're storing is simple objects (dictionaries, lists, strings, booleans, numbers), consider using the JSON file format. Python has a built-in json module that's just as easy to use as pickle, so there's no added complexity. These files are easy to store, view, edit, and compress (if desired), and look almost exactly like a python dictionary. It's highly portable (most common languages support reading/writing JSON files these days), and if you need to improve file loading speed, the ujson module is a faster, drop-in replacement for the standard json module. Since the JSON file format is fairly restricted, I'd expect its parsers and writers to be quite a bit faster than the regular Python or Pickle parsers (especially using ujson).

like image 26
scnerd Avatar answered Nov 23 '22 23:11

scnerd