Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Compressing A Series of JSON Objects While Maintaining Serial Reading?

I have a bunch of json objects that I need to compress as it's eating too much disk space, approximately 20 gigs worth for a few million of them.

Ideally what I'd like to do is compress each individually and then when I need to read them, just iteratively load and decompress each one. I tried doing this by creating a text file with each line being a compressed json object via zlib, but this is failing with a

decompress error due to a truncated stream,

which I believe is due to the compressed strings containing new lines.

Anyone know of a good method to do this?

like image 322
Newmu Avatar asked Dec 08 '13 03:12

Newmu


People also ask

Can JSON files be compressed?

As text data, JSON data compresses nicely. That's why gzip is our first option to reduce the JSON data size. Moreover, it can be automatically applied in HTTP, the common protocol for sending and receiving JSON. Let's take the JSON produced with the default Jackson options and compress it with gzip.

How do I reduce API payload size?

However, you must configure your API to enable compression of the method response payload. To enable compression on an API , set the minimumCompressionsSize property to a non-negative integer between 0 and 10485760 (10M bytes) when you create the API or after you've created the API.

Can JSON store images?

An image is of the type "binary" which is none of those. So you can't directly insert an image into JSON. What you can do is convert the image to a textual representation which can then be used as a normal string. The most common way to achieve that is with what's called base64.


1 Answers

Just use a gzip.GzipFile() object and treat it like a regular file; write JSON objects line by line, and read them line by line.

The object takes care of compression transparently, and will buffer reads, decompressing chucks as needed.

import gzip
import json

# writing
with gzip.GzipFile(jsonfilename, 'w') as outfile:
    for obj in objects:
        outfile.write(json.dumps(obj) + '\n')

# reading
with gzip.GzipFile(jsonfilename, 'r') as infile:
    for line in infile:
        obj = json.loads(line)
        # process obj

This has the added advantage that the compression algorithm can make use of repetition across objects for compression ratios.

like image 97
Martijn Pieters Avatar answered Oct 16 '22 17:10

Martijn Pieters