Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it better to use HBase columns or serialize data using Avro?

Tags:

java

hbase

I working on a project that stores key/value information on a user using HBase. We are in the process of redesiging the HBase schema we are using. The two options being discussed are:

  1. Use HBase column qualifiers as names for the keys. This would make rows wide, but very sparse.
  2. Dump all the data into a single column and serialize it using Avro or Thrift.

What are the design tradeoffs of the two approaches? Is one preferable to the other? Are they are any reasons not to store the data using Avro or Thrift?

like image 900
Shawn H Avatar asked Jan 29 '13 17:01

Shawn H


People also ask

Does Avro support serialization?

Apache Avro™ is the leading serialization format for record data, and first choice for streaming data pipelines. It offers excellent schema evolution, and has implementations for the JVM (Java, Kotlin, Scala, …), Python, C/C++/C#, PHP, Ruby, Rust, JavaScript, and even Perl.

What is serialization in Avro?

Serialization is the process of translating data structures or objects state into binary or textual form to transport the data over network or to store on some persisten storage. Once the data is transported over network or retrieved from the persistent storage, it needs to be deserialized again.

Why do we need Avro?

While we need to store the large set of data on disk, we use Avro, since it helps to conserve space. Moreover, we get a better remote data transfer throughput using Avro for RPC, since Avro produces a smaller binary output compared to java serialization.


1 Answers

In summary, I lean towards using distinct columns per key.

1) Obviously, you are imposing that the client uses Avro/Thrift, which is another dependency. This dependency means you may remove the possibility of certain tooling, like BI tools which expect to find values in the data without transformation.

2) Under the avro/thrift scheme, you are pretty much forced to bring the entire value across the wire. Depending on how much data is in a row, this may not matter. But if you are only interested in 'city' fields/column-qualifier, you still have to get 'payments', 'credit-card-info', etc. This may also pose a security issue.

3) Updates, if required, will be more challenging with Avro/Thrift. Example: you decide to add a 'hasIphone6' key. Avro/Thrift: You will be forced to delete the row and create a new one with the added field. Under the column scheme, a new entry is appended, with only the new column. For a single row, not big, but if you do this to a billion rows, there will need to be a big compaction operation.

4) If configured, you can use compression in HBase, which may exceed the avro/thrift serialization, since it can compress across a column family, instead of just for the single record.

5) BigTable implementations like HBase do very well with very wide, sparse tables, so there won't be a performance hit like you might expect.

like image 130
cmonkey Avatar answered Nov 02 '22 01:11

cmonkey