Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why HBase is a better choice than Cassandra with Hadoop?

Why is using HBase a better choice than using Cassandra with Hadoop?

Can anyone please give a detailed explanation on this?

Thanks

like image 829
Niladri Biswas Avatar asked Feb 19 '13 05:02

Niladri Biswas


People also ask

Which is better HBase or Cassandra?

HBase vs Cassandra - Read Performance Hbase writes to only one data server, unlike Cassandra, who writes onto multiple servers with different versions. Consequently, Hbase reads are more accessible than of Cassandra. Hbase stores its data in HDFS that provides bloom filters and block caches for faster read performance.

Is HBase same as Cassandra?

HBase has a master-based architecture while Cassandra has a masterless one. It means that HBase comes with a single failure point, while Cassandra does not. The HBase client communicates directly with slave-server without contacting master, this gives a working time once the master is down.

What are the advantages of HBase?

Let us check some of the advantages of HBase: Random and consistent Reads/Writes access in high volume request. Auto failover and reliability. Flexible, column-based multidimensional map structure.


1 Answers

I don't think either is better than the others, it's not just one or the other. These are very different systems, each with their strengths and weaknesses, so it really depends on your use cases. They can definitely be used in complement of one another in the same infrastructure.

To explain the difference better I'd like to borrow a picture from Cassandra: the Definitive Guide, where they go over the CAP theorem. What they say is basically for any distributed system, you have to find a balance between consistency, availability and partition tolerance, and you can only realistically satisfy 2 of these properties. From that you can see that:

  • Cassandra satisfies the Availability and Partition Tolerance properties.
  • HBase satisfied the Consistency and Partition Tolerance properties.

CAP

When it comes to Hadoop, HBase is built on top of HDFS, which makes it pretty convenient to use if you already have a Hadoop stack. It is also supported by Cloudera, which is a standard enterprise distribution for Hadoop.

But Cassandra also has more integration with Hadoop, namely Datastax Brisk which is gaining popularity. You can also now natively stream data from the output of a Hadoop job into a Cassandra cluster using some Cassandra-provided output format (BulkOutputFormat for example), we are no longer to the point where Cassandra was just a standalone project.

In my experience, I've found that Cassandra is awesome for random reads, and not so much for scans

To put a little color to the picture, I've been using both at my job in the same infrastructure, and HBase has a very different purpose than Cassandra. I've used Cassandra mostly for real-time very fast lookups, while I've used HBase more for heavy ETL batch jobs with lower latency requirements.

This is a question that would truly be worthy of a blog post, so instead of going on and on I'd like to point you to an article which sums up a lot of the keys differences between the 2 systems. Bottom line is, there is no superior solution IMHO, and you should really think about your use cases to see which system is better suited.

like image 139
Charles Menguy Avatar answered Sep 21 '22 13:09

Charles Menguy