Why is using HBase
a better choice than using Cassandra
with Hadoop
?
Can anyone please give a detailed explanation on this?
Thanks
HBase vs Cassandra - Read Performance Hbase writes to only one data server, unlike Cassandra, who writes onto multiple servers with different versions. Consequently, Hbase reads are more accessible than of Cassandra. Hbase stores its data in HDFS that provides bloom filters and block caches for faster read performance.
HBase has a master-based architecture while Cassandra has a masterless one. It means that HBase comes with a single failure point, while Cassandra does not. The HBase client communicates directly with slave-server without contacting master, this gives a working time once the master is down.
Let us check some of the advantages of HBase: Random and consistent Reads/Writes access in high volume request. Auto failover and reliability. Flexible, column-based multidimensional map structure.
I don't think either is better than the others, it's not just one or the other. These are very different systems, each with their strengths and weaknesses, so it really depends on your use cases. They can definitely be used in complement of one another in the same infrastructure.
To explain the difference better I'd like to borrow a picture from Cassandra: the Definitive Guide, where they go over the CAP theorem. What they say is basically for any distributed system, you have to find a balance between consistency, availability and partition tolerance, and you can only realistically satisfy 2 of these properties. From that you can see that:
When it comes to Hadoop, HBase is built on top of HDFS, which makes it pretty convenient to use if you already have a Hadoop stack. It is also supported by Cloudera, which is a standard enterprise distribution for Hadoop.
But Cassandra also has more integration with Hadoop, namely Datastax Brisk which is gaining popularity. You can also now natively stream data from the output of a Hadoop job into a Cassandra cluster using some Cassandra-provided output format (BulkOutputFormat
for example), we are no longer to the point where Cassandra was just a standalone project.
In my experience, I've found that Cassandra is awesome for random reads, and not so much for scans
To put a little color to the picture, I've been using both at my job in the same infrastructure, and HBase has a very different purpose than Cassandra. I've used Cassandra mostly for real-time very fast lookups, while I've used HBase more for heavy ETL batch jobs with lower latency requirements.
This is a question that would truly be worthy of a blog post, so instead of going on and on I'd like to point you to an article which sums up a lot of the keys differences between the 2 systems. Bottom line is, there is no superior solution IMHO, and you should really think about your use cases to see which system is better suited.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With