What's the most space-efficient way to compress serialized Python data?

Tags:

From the Python documentation:

By default, the pickle data format uses a relatively compact binary representation. If you need optimal size characteristics, you can efficiently compress pickled data.

I'm going to be serializing several gigabytes of data at the end of a process that runs for several hours, and I'd like the result to be as small as possible on disk. However, Python offers several different ways to compress data.

Is there one of these that's particularly efficient for pickled files? The data I'm pickling mostly consists of nested dictionaries and strings, so if there's a more efficient way to compress e.g. JSON, that would work too.

The time for compression and decompression isn't important, but the time this process takes to generate the data makes trial and error inconvenient.

505

asked Sep 18 '19 00:09

Draconis

2 Answers

I've done some test using a Pickled object, lzma gave the best compression.

But your results can vary based on your data, I'd recommend testing them with some sample data of your own.

Mode                LastWriteTime         Length Name
----                -------------         ------ ----
-a----        9/17/2019  10:05 PM       23869925 no_compression.pickle
-a----        9/17/2019  10:06 PM        6050027 gzip_test.gz
-a----        9/17/2019  10:06 PM        3083128 bz2_test.pbz2
-a----        9/17/2019  10:07 PM        1295013 brotli_test.bt
-a----        9/17/2019  10:06 PM        1077136 lmza_test.xz

Test file used (you'll need to pip install brotli or remove that algorithm):

import bz2
import gzip
import lzma
import pickle

import brotli


class SomeObject():

    a = 'some data'
    b = 123
    c = 'more data'

    def __init__(self, i):
        self.i = i


data = [SomeObject(i) for i in range(1, 1000000)]

with open('no_compression.pickle', 'wb') as f:
    pickle.dump(data, f)

with gzip.open("gzip_test.gz", "wb") as f:
    pickle.dump(data, f)

with bz2.BZ2File('bz2_test.pbz2', 'wb') as f:
    pickle.dump(data, f)

with lzma.open("lmza_test.xz", "wb") as f:
    pickle.dump(data, f)

with open('no_compression.pickle', 'rb') as f:
    pdata = f.read()
    with open('brotli_test.bt', 'wb') as b:
        b.write(brotli.compress(pdata))

145

answered Oct 27 '22 04:10

Gabriel Cappelli

Just adding an alternative that easily provided me with the highest compression ratio and on top of that did it so fast I was sure I made a mistake somewhere (I didn't). The real bonus is that the decompression is also very fast, so any program that reads in lots of preprocessed data, for example, will benefit hugely from this. One potential caveat is that there is mention of "small arrays (<2GB)" somewhere here, but it looks like there are ways around that. Or, if you're lazy like me, breaking up your data instead is usually an option.

Some smart cookies came up with python-blosc. It's a "high performance compressor", according to their docs. I was lead to it from an answer to this question.

Once installed via, e.g. pip install blosc or conda install python-blosc, you can compress pickled data pretty easily as follows:

import blosc
import numpy as np
import pickle

data = np.random.rand(3, 3, 1e7)

pickled_data = pickle.dumps(data)  # returns data as a bytes object
compressed_pickle = blosc.compress(pickled_data)

with open("path/to/file/test.dat", "wb") as f:
    f.write(compressed_pickle)

And to read it:

with open("path/to/file/test.dat", "rb") as f:
    compressed_pickle = f.read()

depressed_pickle = blosc.decompress(compressed_pickle)
data = pickle.loads(depressed_pickle)  # turn bytes object back into data

I'm using Python 3.7 and without even looking at all the different compression options I got a compression ratio of about 12 and reading + decompressing + loading the compressed pickle file took a fraction of a second longer than loading the uncompressed pickle file.

I wrote this more as a reference for myself, but I hope someone else will find this useful.

Peace oot

answered Oct 27 '22 04:10

miterhen

Related questions
                            
                                Python Statsmodels Mixedlm (Mixed Linear Model) random effects
                            
                                Python Pandas: Groupby Sum AND Concatenate Strings
                            
                                How generate all pairs of values, from the result of a groupby, in a pandas dataframe
                            
                                Doing the opposite of pivot in pandas Python
                            
                                Restricting all the views to authenticated users in Django
                            
                                How to filter JSON Array in Django JSONField
                            
                                access remote files on server with smb protocol python3
                            
                                Running Julia .jl file in python
                            
                                Pandas: convert date 'object' to int
                            
                                Pandas - Add Column Name to Results of groupby [duplicate]
                            
                                Dynamic table with Python
                            
                                Transposing selected MultiIndex levels in Pandas DataFrame
                            
                                Conda command working in command prompt but not in bash script
                            
                                Python 3.6 DateTime Strptime Returns error while Python 3.7 works well
                            
                                Anaconda prompt closes immediately - the system was unable to find the specified registry key or value
                            
                                How to upload multiple files with flask-wtf?
                            
                                Theoretical vs actual time-complexity for algorithm calculating 2^n
                            
                                How to access the network weights while using PyTorch 'nn.Sequential'?
                            
                                how to set logging level from command line
                            
                                How to create a dictionary using a single list?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What's the most space-efficient way to compress serialized Python data?

Tags:

python

serialization

compression

pickle

Draconis

People also ask

2 Answers

Gabriel Cappelli

miterhen

Recent Activity

Donate For Us