Compressing A Series of JSON Objects While Maintaining Serial Reading?

Tags:

I have a bunch of json objects that I need to compress as it's eating too much disk space, approximately 20 gigs worth for a few million of them.

Ideally what I'd like to do is compress each individually and then when I need to read them, just iteratively load and decompress each one. I tried doing this by creating a text file with each line being a compressed json object via zlib, but this is failing with a

decompress error due to a truncated stream,

which I believe is due to the compressed strings containing new lines.

Anyone know of a good method to do this?

322

asked Dec 08 '13 03:12

Newmu

1 Answers

Just use a gzip.GzipFile() object and treat it like a regular file; write JSON objects line by line, and read them line by line.

The object takes care of compression transparently, and will buffer reads, decompressing chucks as needed.

import gzip
import json

# writing
with gzip.GzipFile(jsonfilename, 'w') as outfile:
    for obj in objects:
        outfile.write(json.dumps(obj) + '\n')

# reading
with gzip.GzipFile(jsonfilename, 'r') as infile:
    for line in infile:
        obj = json.loads(line)
        # process obj

This has the added advantage that the compression algorithm can make use of repetition across objects for compression ratios.

answered Oct 16 '22 17:10

Martijn Pieters

Related questions
                            
                                checking if there is a folder with a name that start with a specific string
                            
                                Generate a list of strings with a sliding window using itertools, yield, and iter() in Python 2.7.1?
                            
                                Can argparse in python 2.7 be told to require a minimum of TWO arguments?
                            
                                Regular expression for repeating sequence
                            
                                How to compare all items in a list with an integer without using for loop
                            
                                Python: What do double parenthesis do?
                            
                                How can I reduce the number of digits after decimal point when writing floats to file?
                            
                                Python Dictionary return requested key if value does not exist
                            
                                Finding differences between strings
                            
                                URL Safe Base64 in Objective-C
                            
                                Why does Python return negative list indexes?
                            
                                Populating dictionary values from a list
                            
                                Django model inheritance - only want instances of parent class in a query
                            
                                Python 2.7.3 . . . Write .jpg/.png image file?
                            
                                urllib "module object is not callable"
                            
                                How can I use a variable as index in django template?
                            
                                Is this not a tuple?
                            
                                Convert float number to string with engineering notation (with SI prefix) in Python
                            
                                Convert scientific notation to decimal - python
                            
                                Deactivate function with decorator

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Compressing A Series of JSON Objects While Maintaining Serial Reading?

Tags:

python

json

file-io

compression

zlib

Newmu

People also ask

1 Answers

Martijn Pieters

Recent Activity

Donate For Us