Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What are the pros or cons of storing json as text vs blob in cassandra?

One problem with blob for me is, in java, ByteBuffer (which is mapped to blob in cassandra) is not Serializable hence does not work well with EJBs.

Considering the json is fairly large what would be the better type for storing json in cassandra. Is it text or blob?

Does the size of the json matter when deciding the blob vs json?

If it were any other database like oracle, it's common to use blob/clob. But in Cassandra where each cell can hold as large as 2GB, does it matter?

Please consider this question as the choose between text vs blob for this case, instead of sorting to suggestions regarding whether to use single column for json.

like image 206
pinkpanther Avatar asked Jul 10 '15 11:07

pinkpanther


1 Answers

I don't think there's any benefit for storing the literal JSON data as a BLOB in Cassandra. At best your storage costs are identical, and in general the API's are less convenient in terms of working with BLOB types as they are for working with strings/text.

For instance, if you're using their Java API then in order to store the data as a BLOB using a parameterized PreparedStatement you first need to load it all into a ByteBuffer, for instance by packing your JSON data into an InputStream.

Unless you're dealing with very large JSON snippets that force you to stream your data anyways, that's a fair bit of extra work to get access to the BLOB type. And what would you gain from it? Essentially nothing.

However, I think there's some merit in asking 'Should I store JSON as text, or gzip it and store the compressed data as a BLOB?'.

And the answer to that comes down to how you've configured Cassandra and your table. In particular, as long as you're using Cassandra version 1.1 or later your tables have compression enabled by default. That may be adequate, particularly if your JSON data is fairly uniform across each row.

However, Cassandra's built-in compression is applied table-wide, rather than to individual rows. So you may get a better compression ratio by manually compressing your JSON data before storage, writing the compressed bytes into a ByteBuffer, and then shipping the data into Cassandra as a BLOB.

So it essentially comes down to a tradeoff in terms of storage space vs. programming convenience vs. CPU usage. I would decide the matter as follows:

  1. Is minimizing the amount of storage consumed your biggest concern?
    • If yes, compress the JSON data and store the compressed bytes as a BLOB;
    • Otherwise, proceed to #2.
  2. Is Cassandra's built-in compression available and enabled for your table?
    • If no (and if you can't enable the compression), compress the JSON data and store the compressed bytes as a BLOB;
    • Otherwise, proceed to #3.
  3. Is the data you'll be storing relatively uniform across each row?
    • Probably for JSON data the answer is 'yes', in which case you should store the data as text and let Cassandra handle the compression;
    • Otherwise proceed to #4.
  4. Do you want efficiency, or convenience?
    • Efficiency; compress the JSON data and store the compressed bytes as a BLOB.
    • Convenience; compress the JSON data, base64 the compressed data, and then store the base64-encoded data as text.
like image 104
aroth Avatar answered Oct 07 '22 01:10

aroth