Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which is the best way to compress json to store in a memory based store like redis or memcache?

Requirement : Python objects with 2-3 levels of nesting containing basic datypes like integers,strings, lists, and dicts. ( no dates etc), needs to be stored as json in redis against a key. What are the best methods available for compressing json as a string for low memory footprint. The target objects are not very large, having 1000 small elements on average, or about 15000 characters when converted to JSON.

eg.

>>> my_dict
{'details': {'1': {'age': 13, 'name': 'dhruv'}, '2': {'age': 15, 'name': 'Matt'}}, 'members': ['1', '2']}
>>> json.dumps(my_dict)
'{"details": {"1": {"age": 13, "name": "dhruv"}, "2": {"age": 15, "name": "Matt"}}, "members": ["1", "2"]}'
### SOME BASIC COMPACTION ###
>>> json.dumps(my_dict, separators=(',',':'))
'{"details":{"1":{"age":13,"name":"dhruv"},"2":{"age":15,"name":"Matt"}},"members":["1","2"]}'

1/ Are there any other better ways to compress json to save memory in redis ( also ensuring light weight decoding afterwards ).

2/ How good a candidate would be msgpack [http://msgpack.org/] ?

3/ Shall I consider options like pickle as well ?

like image 477
DhruvPathak Avatar asked Mar 20 '13 14:03

DhruvPathak


People also ask

How do I compress a JSON file?

As text data, JSON data compresses nicely. That's why gzip is our first option to reduce the JSON data size. Moreover, it can be automatically applied in HTTP, the common protocol for sending and receiving JSON. Let's take the JSON produced with the default Jackson options and compress it with gzip.

Should JSON be compressed?

Though JSON is a concise format, it is also better used over a slow network in compressed mode.

How big should a JSON payload be?

There is a fixed number (1GB), but the ideal size is going to be less than this value. If you need a number, my recommendation is usually to try to keep JSON documents to a few MB in size. This makes the most sense.

What is the advantage of using JSON for data storage in Python?

JSON newline delimited files A huge advantage is that you can iterate over each data point and do not need to load the whole object (list) in memory. There is a library in Python jsonlines to write these files.


2 Answers

We just use gzip as a compressor.

import gzip
import cStringIO

def decompressStringToFile(value, outputFile):
  """
  decompress the given string value (which must be valid compressed gzip
  data) and write the result in the given open file.
  """
  stream = cStringIO.StringIO(value)
  decompressor = gzip.GzipFile(fileobj=stream, mode='r')
  while True:  # until EOF
    chunk = decompressor.read(8192)
    if not chunk:
      decompressor.close()
      outputFile.close()
      return 
    outputFile.write(chunk)

def compressFileToString(inputFile):
  """
  read the given open file, compress the data and return it as string.
  """
  stream = cStringIO.StringIO()
  compressor = gzip.GzipFile(fileobj=stream, mode='w')
  while True:  # until EOF
    chunk = inputFile.read(8192)
    if not chunk:  # EOF?
      compressor.close()
      return stream.getvalue()
    compressor.write(chunk)

In our usecase we store the result as files, as you can imagine. To use just in-memory strings, you can use a cStringIO.StringIO() object as a replacement for the file as well.

like image 151
Alfe Avatar answered Oct 17 '22 18:10

Alfe


Based on @Alfe's answer above here is a version that keeps the contents in memory (for network I/O tasks). I also made a few changes to support Python 3.

import gzip
from io import StringIO, BytesIO

def decompressBytesToString(inputBytes):
  """
  decompress the given byte array (which must be valid 
  compressed gzip data) and return the decoded text (utf-8).
  """
  bio = BytesIO()
  stream = BytesIO(inputBytes)
  decompressor = gzip.GzipFile(fileobj=stream, mode='r')
  while True:  # until EOF
    chunk = decompressor.read(8192)
    if not chunk:
      decompressor.close()
      bio.seek(0)
      return bio.read().decode("utf-8")
    bio.write(chunk)
  return None

def compressStringToBytes(inputString):
  """
  read the given string, encode it in utf-8,
  compress the data and return it as a byte array.
  """
  bio = BytesIO()
  bio.write(inputString.encode("utf-8"))
  bio.seek(0)
  stream = BytesIO()
  compressor = gzip.GzipFile(fileobj=stream, mode='w')
  while True:  # until EOF
    chunk = bio.read(8192)
    if not chunk:  # EOF?
      compressor.close()
      return stream.getvalue()
    compressor.write(chunk)

To test the compression try:

inputString="asdf" * 1000
len(inputString)
len(compressStringToBytes(inputString))
decompressBytesToString(compressStringToBytes(inputString))
like image 11
AlaskaJoslin Avatar answered Oct 17 '22 16:10

AlaskaJoslin