Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Large scale data processing Hbase vs Cassandra [closed]

I am nearly landed at Cassandra after my research on large scale data storage solutions. But its generally said that Hbase is better solution for large scale data processing and analysis.

While both are same key/value storage and both are/can run (Cassandra recently) Hadoop layer then what makes Hadoop a better candidate when processing/analysis is required on large data.

I also found good details about both at http://ria101.wordpress.com/2010/02/24/hbase-vs-cassandra-why-we-moved/

but I'm still looking for concrete advantages of Hbase.

While I am more convinced about Cassandra because its simplicity for adding nodes and seamless replication and no point of failure features. And it also keeps secondary index feature so its a good plus.

like image 337
Gary Lindahl Avatar asked Aug 29 '11 23:08

Gary Lindahl


People also ask

Is Cassandra better than HBase?

HBase vs Cassandra - Read Performance Hbase writes to only one data server, unlike Cassandra, who writes onto multiple servers with different versions. Consequently, Hbase reads are more accessible than of Cassandra. Hbase stores its data in HDFS that provides bloom filters and block caches for faster read performance.

Is HBase similar to Cassandra?

Cassandra's and HBase's on-server write paths are very much alike. There're only slight differences: names for data structures and the fact that, unlike Cassandra, HBase doesn't write to the log and cache simultaneously (it makes writes slower).

How is HBase different from other NoSQL model?

Apache HBase is a NoSQL key/value store which runs on top of HDFS. Unlike Hive, HBase operations run in real-time on its database rather than MapReduce jobs. HBase is partitioned to tables, and tables are further split into column families.

Is HBase good for read or write?

Yes: Cassandra is very fast writing bulk data in sequence and reading them sequentially. HBase is very good at random IO because of HDFS.


1 Answers

As a Cassandra developer, I'm better at answering the other side of the question:

  • Cassandra scales better. Cassandra is known to scale to over 400 nodes in a cluster; when Facebook deployed Messaging on top of HBase they had to shard it across 100-node HBase sub-clusters.
  • Cassandra supports hundreds, even thousands of ColumnFamilies. "HBase currently does not do well with anything above two or three column families."
  • As a fully distributed system with no "special" nodes or processes, Cassandra is simpler to set up and operate, easier to troubleshoot, and more robust.
  • Cassandra's support for multi-master replication means that not only do you get the obvious power of multiple datacenters -- geographic redundancy, local latencies -- but you can also split realtime and analytical workloads into separate groups, with realtime, bidirectional replication between them. If you don't split those workloads apart they will contend spectacularly.
  • Because each Cassandra node manages its own local storage, Cassandra has a substantial performance advantage that is unlikely to be narrowed significantly. (E.g., it's standard practice to put the Cassandra commitlog on a separate device so it can do its sequential writes unimpeded by random i/o from read requests.)
  • Cassandra allows you to choose how strong you want it to require consistency to be on a per-operation basis. Sometimes this is misunderstood as "Cassandra does not give you strong consistency," but that is incorrect.
  • Cassandra offers RandomPartitioner as well as the more Bigtable-like OrderedPartitioner. RandomPartitioner is much less prone to hot spots.
  • Cassandra offers on- or off-heap caching with performance comparable to memcached, but without the cache consistency problems or complexity of requiring extra moving parts
  • Non-Java clients are not second-class citizens

To my knowledge, the main advantage HBase has right now (HBase 0.90.4 and Cassandra 0.8.4) is that Cassandra does not yet support transparent data compression. (This has been added for Cassandra 1.0, due in early October, but today that is a real advantage for HBase.) HBase may also be better optimized for the kinds of range scans done by Hadoop batch processing.

There are also some things that are not necessarily better, or worse, just different. HBase adheres more strictly to the Bigtable data model, where each column is versioned implicitly. Cassandra drops versioning, and adds SuperColumns instead.

Hope that helps!

like image 174
jbellis Avatar answered Sep 19 '22 08:09

jbellis