I working on a project that stores key/value information on a user using HBase. We are in the process of redesiging the HBase schema we are using. The two options being discussed are:
What are the design tradeoffs of the two approaches? Is one preferable to the other? Are they are any reasons not to store the data using Avro or Thrift?
Apache Avro™ is the leading serialization format for record data, and first choice for streaming data pipelines. It offers excellent schema evolution, and has implementations for the JVM (Java, Kotlin, Scala, …), Python, C/C++/C#, PHP, Ruby, Rust, JavaScript, and even Perl.
Serialization is the process of translating data structures or objects state into binary or textual form to transport the data over network or to store on some persisten storage. Once the data is transported over network or retrieved from the persistent storage, it needs to be deserialized again.
While we need to store the large set of data on disk, we use Avro, since it helps to conserve space. Moreover, we get a better remote data transfer throughput using Avro for RPC, since Avro produces a smaller binary output compared to java serialization.
In summary, I lean towards using distinct columns per key.
1) Obviously, you are imposing that the client uses Avro/Thrift, which is another dependency. This dependency means you may remove the possibility of certain tooling, like BI tools which expect to find values in the data without transformation.
2) Under the avro/thrift scheme, you are pretty much forced to bring the entire value across the wire. Depending on how much data is in a row, this may not matter. But if you are only interested in 'city' fields/column-qualifier, you still have to get 'payments', 'credit-card-info', etc. This may also pose a security issue.
3) Updates, if required, will be more challenging with Avro/Thrift. Example: you decide to add a 'hasIphone6' key. Avro/Thrift: You will be forced to delete the row and create a new one with the added field. Under the column scheme, a new entry is appended, with only the new column. For a single row, not big, but if you do this to a billion rows, there will need to be a big compaction operation.
4) If configured, you can use compression in HBase, which may exceed the avro/thrift serialization, since it can compress across a column family, instead of just for the single record.
5) BigTable implementations like HBase do very well with very wide, sparse tables, so there won't be a performance hit like you might expect.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With