Is there a faster way to store a big dictionary, than pickle or regular Python file? [closed]

Tags:

pickle

I want to store a dictionary which only contains data in the following format:

{
    "key1" : True,
    "key2" : True,
    .....
}

In other words, just a quick way to check if a key is valid or not. I can do this by storing a dict called foo in a file called bar.py, and then in my other modules, I can import it as follows:

from bar import foo

Or, I can save it in a pickle file called bar.pickle, and import it at the top of the file as follows:

import pickle  
with open('bar.pickle', 'rb') as f:
    foo = pickle.load(f)

Which would be the ideal, and faster way to do this?

649

asked Jan 02 '19 19:01

2 Answers

To add to @scnerd's comment, here are the timings in IPython for different load situations.

Here we create a dictionary and write it to 3 formats:

import random
import json
import pickle

letters = 'abcdefghijklmnopqrstuvwxyz'
d = {''.join(random.choices(letters, k=6)): random.choice([True, False]) 
     for _ in range(100000)}

# write a python file
with open('mydict.py', 'w') as fp:
    fp.write('d = {\n')
    for k,v in d.items():
        fp.write(f"'{k}':{v},\n")
    fp.write('None:False}')

# write a pickle file
with open('mydict.pickle', 'wb') as fp:
    pickle.dump(d, fp)

# write a json file
with open('mydict.json', 'wb') as fp:
    json.dump(d, fp)

Python file:

# on first import the file will be cached.  
%%timeit -n1 -r1
from mydict import d

644 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

# after creating the __pycache__ folder, import is MUCH faster
%%timeit
from mydict import d

1.37 µs ± 54.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

pickle file:

%%timeit
with open('mydict.pickle', 'rb') as fp:
    pickle.load(fp)

52.4 ms ± 1.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

json file:

%%timeit
with open('mydict.json', 'rb') as fp:
    json.load(fp)

81.3 ms ± 2.21 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# here is the same test with ujson
import ujson

%%timeit
with open('mydict.json', 'rb') as fp:
    ujson.load(fp)

51.2 ms ± 304 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

answered Nov 23 '22 23:11

Using a python file will easily cache the dictionary, so that if you "import" it multiple times, it only has to be parsed once. However, python syntax is complicated, and so the parser that loads the file may not be well optimized for the limited complexity of the data you're saving (unless you're including arbitrary Python objects and code). It's easy to view and edit, and easy to use, but it's not easy to transport.

EDIT: to clarify, raw Python files are easy for a human to modify, but very hard for a computer to edit. If your code edits the data and you ever want that to be reflected in the dictionary, you're pretty much up a creek: instead, use one of the methods below.

Pickle File

If you use a pickle file, you'd either re-load the file each time you use it, or need some management code to cache the file after reading it the first time. Like arbitrary Python code, pickle files can be quite complex and the loader for them might not be optimized for your particular data types since, like raw python files, they can also store most arbitrary Python objects. However, they're hard to edit and view for a regular human, and you might encounter portability issues if you move the data around. It's also only readable by Python, and you need to consider the security considerations of using pickle, since loading pickle files can be risky and should only be done with trusted files.

JSON File

If all you're storing is simple objects (dictionaries, lists, strings, booleans, numbers), consider using the JSON file format. Python has a built-in json module that's just as easy to use as pickle, so there's no added complexity. These files are easy to store, view, edit, and compress (if desired), and look almost exactly like a python dictionary. It's highly portable (most common languages support reading/writing JSON files these days), and if you need to improve file loading speed, the ujson module is a faster, drop-in replacement for the standard json module. Since the JSON file format is fairly restricted, I'd expect its parsers and writers to be quite a bit faster than the regular Python or Pickle parsers (especially using ujson).

answered Nov 23 '22 23:11

scnerd

Related questions
                            
                                How to loop through Azure Datalake Store files in Azure Databricks
                            
                                how to remove a back slash from a JSON file
                            
                                Validate datetime value using python jsonschema
                            
                                Python, PIP, Failed building wheel for line-profiler [duplicate]
                            
                                Django : Direct assignment to the forward side of a many-to-many set is prohibited. Use user.set() instead
                            
                                Async/IO and Parallelism
                            
                                Why is django-rest-frameworks request.data sometimes immutable?
                            
                                How do I save word cloud as .png in python?
                            
                                TensorFlow 1.10+ custom estimator early stopping with train_and_evaluate
                            
                                AttributeError: module 'telegram' has no attribute 'Bot'
                            
                                Working with SSIM loss function in tensorflow for RGB images
                            
                                Reading multiple csv files from S3 bucket with boto3
                            
                                Jupyter Notebook: Importing SMOTE from imblearn - ImportError: cannot import name 'pairwise_distances_chunked'
                            
                                How can I subtract 3 hours of a datetime in python?
                            
                                Pandas extensive 'describe' include count the null values
                            
                                Why do you not have to return objects in Python?
                            
                                How to setup PyCharm to develop AWS Lambda function on local machine?
                            
                                2d array to two columns in dataframe
                            
                                How to launch PyCharm from terminal/command prompt
                            
                                Simplification of if multiple conditions in python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is there a faster way to store a big dictionary, than pickle or regular Python file? [closed]

Tags:

python

pickle

darkhorse

People also ask