Why HBase is a better choice than Cassandra with Hadoop?

1 Answers

I don't think either is better than the others, it's not just one or the other. These are very different systems, each with their strengths and weaknesses, so it really depends on your use cases. They can definitely be used in complement of one another in the same infrastructure.

To explain the difference better I'd like to borrow a picture from Cassandra: the Definitive Guide, where they go over the CAP theorem. What they say is basically for any distributed system, you have to find a balance between consistency, availability and partition tolerance, and you can only realistically satisfy 2 of these properties. From that you can see that:

Cassandra satisfies the Availability and Partition Tolerance properties.
HBase satisfied the Consistency and Partition Tolerance properties.

CAP

When it comes to Hadoop, HBase is built on top of HDFS, which makes it pretty convenient to use if you already have a Hadoop stack. It is also supported by Cloudera, which is a standard enterprise distribution for Hadoop.

But Cassandra also has more integration with Hadoop, namely Datastax Brisk which is gaining popularity. You can also now natively stream data from the output of a Hadoop job into a Cassandra cluster using some Cassandra-provided output format (BulkOutputFormat for example), we are no longer to the point where Cassandra was just a standalone project.

In my experience, I've found that Cassandra is awesome for random reads, and not so much for scans

To put a little color to the picture, I've been using both at my job in the same infrastructure, and HBase has a very different purpose than Cassandra. I've used Cassandra mostly for real-time very fast lookups, while I've used HBase more for heavy ETL batch jobs with lower latency requirements.

This is a question that would truly be worthy of a blog post, so instead of going on and on I'd like to point you to an article which sums up a lot of the keys differences between the 2 systems. Bottom line is, there is no superior solution IMHO, and you should really think about your use cases to see which system is better suited.

139

answered Sep 21 '22 13:09

Charles Menguy

Related questions
                            
                                What is the relationship between Spark, Hadoop and Cassandra
                            
                                Cannot Read a file from HDFS using Spark
                            
                                How to choose between Cassandra, Membase, Hadoop, MongoDB, RDBMS etc.? [closed]
                            
                                How do I get schema / column names from parquet file?
                            
                                How does Hadoop perform input splits?
                            
                                Why do we need ZooKeeper in the Hadoop stack?
                            
                                Ports are not available: listen tcp 0.0.0.0/50070: bind: An attempt was made to access a socket in a way forbidden by its access permissions
                            
                                SparkSQL vs Hive on Spark - Difference and pros and cons?
                            
                                Why spark-shell fails with NullPointerException?
                            
                                Thrift, Avro, Protocolbuffers - Are they all dead?
                            
                                Setting the number of map tasks and reduce tasks
                            
                                How to get started with Big Data Analysis [closed]
                            
                                Free Large datasets to experiment with Hadoop
                            
                                Datanode process not running in Hadoop
                            
                                Datanode not starts correctly
                            
                                Cascading examples failed to compile?
                            
                                Spark on yarn concept understanding
                            
                                Cleanest way in Gradle to get the path to a jar file in the gradle dependency cache
                            
                                What is best way to start and stop hadoop ecosystem, with command line?
                            
                                How to get the input file name in the mapper in a Hadoop program?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why HBase is a better choice than Cassandra with Hadoop?

Tags:

nosql

cassandra

hadoop

hbase

cap-theorem

Niladri Biswas

People also ask

1 Answers

Charles Menguy

Recent Activity

Donate For Us