I am nearly landed at Cassandra after my research on large scale data storage solutions. But its generally said that Hbase is better solution for large scale data processing and analysis.
While both are same key/value storage and both are/can run (Cassandra recently) Hadoop layer then what makes Hadoop a better candidate when processing/analysis is required on large data.
I also found good details about both at http://ria101.wordpress.com/2010/02/24/hbase-vs-cassandra-why-we-moved/
but I'm still looking for concrete advantages of Hbase.
While I am more convinced about Cassandra because its simplicity for adding nodes and seamless replication and no point of failure features. And it also keeps secondary index feature so its a good plus.
HBase vs Cassandra - Read Performance Hbase writes to only one data server, unlike Cassandra, who writes onto multiple servers with different versions. Consequently, Hbase reads are more accessible than of Cassandra. Hbase stores its data in HDFS that provides bloom filters and block caches for faster read performance.
Cassandra's and HBase's on-server write paths are very much alike. There're only slight differences: names for data structures and the fact that, unlike Cassandra, HBase doesn't write to the log and cache simultaneously (it makes writes slower).
Apache HBase is a NoSQL key/value store which runs on top of HDFS. Unlike Hive, HBase operations run in real-time on its database rather than MapReduce jobs. HBase is partitioned to tables, and tables are further split into column families.
Yes: Cassandra is very fast writing bulk data in sequence and reading them sequentially. HBase is very good at random IO because of HDFS.
As a Cassandra developer, I'm better at answering the other side of the question:
To my knowledge, the main advantage HBase has right now (HBase 0.90.4 and Cassandra 0.8.4) is that Cassandra does not yet support transparent data compression. (This has been added for Cassandra 1.0, due in early October, but today that is a real advantage for HBase.) HBase may also be better optimized for the kinds of range scans done by Hadoop batch processing.
There are also some things that are not necessarily better, or worse, just different. HBase adheres more strictly to the Bigtable data model, where each column is versioned implicitly. Cassandra drops versioning, and adds SuperColumns instead.
Hope that helps!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With