Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Bigtable / HBase: Rich column family vs a single JSON Object

I want to store quite a big amount of data on Google Cloud Bigtable (A few PetaBytes) for serving purposes. I plan to access the data using the primary key, sometimes by a key-prefix-query.

No data updates are planned. Only appends to existing tables.

My question is: Since I don't use any of my columns to filter / query / sort my queries (which is impossible in Bigtable anyway) Is there any benefit to store my data in separated columns rather than a single JSON document per row?

Thanks!

like image 755
Forepick Avatar asked Jun 19 '16 21:06

Forepick


1 Answers

Disclosure: I lead product management for Cloud Bigtable.

If you don't plan to retrieve or update data on a per-column granularity, your plan of storing JSON document as a single value is fine, particularly because if you store per-column data, the column family name itself (and the qualifier) need to also be stored within each row, thus adding storage overhead, which is proportional to the number of values and thus may be meaningful at your scale. In your model, you'll be using Bigtable as simply a key-value store.

If you do decide to break your JSON apart into many columns in the future, you can add additional column families to an existing Bigtable table (or just use additional column qualifiers within your existing column family) and rewrite your data via a parallel process such as Hadoop MapReduce or Google Cloud Dataflow.

Side note: JSON is very verbose and takes up a bit of space; while you can precompress it yourself, Cloud Bigtable natively compresses data (transparently) to help mitigate this. That said, one alternative to consider is protocol buffers or another binary encoding to be more efficient with space.

Given that you plan to store multiple petabytes of data, you will likely need more than the default quota of 30 Bigtable nodes—if so, please request additional quota for your use case.

Please see the Bigtable performance page for a ballpark measure of performance you should expect per Bigtable server node, but you should benchmark your specific read/write patterns to establish the baseline norms, and scale accordingly.

Best of luck with your project!

like image 119
Misha Brukman Avatar answered Sep 22 '22 15:09

Misha Brukman