I am nearly landed at Cassandra after my research on large scale data storage solutions. But its generally said that Hbase is better solution for large scale data processing and analysis. While both are same key/value storage and both are/can run (Cassandra recently) Hadoop layer then what makes Hadoop a better candidate when processing/analysis is required on large data. I also found good details about both at http://ria101.wordpress.com/2010/02/24/hbase-vs-cassandra-why-we-moved/ but I'm still looking for concrete advantages of Hbase. While I am more convinced about Cassandra because its simplicity for adding nodes and seamless replication and no point of failure features. And it also keeps secondary index feature so its a good plus.

As a Cassandra developer, I'm better at answering the other side of the question: <ul> <li>Cassandra scales better. Cassandra is known to scale to over 400 nodes in a cluster; when Facebook deployed Messaging on top of HBase they had to shard it across 100-node HBase sub-clusters.</li> <li>Cassandra supports hundreds, even thousands of ColumnFamilies. "HBase currently does not do well with anything above two or three column families."</li> <li>As a fully distributed system with no "special" nodes or processes, Cassandra is simpler to set up and operate, easier to troubleshoot, and more robust.</li> <li>Cassandra's support for multi-master replication means that not only do you get the obvious power of multiple datacenters -- geographic redundancy, local latencies -- but you can also split realtime and analytical workloads into separate groups, with realtime, bidirectional replication between them. If you don't split those workloads apart they will contend spectacularly.</li> <li>Because each Cassandra node manages its own local storage, Cassandra has a substantial performance advantage that is unlikely to be narrowed significantly. (E.g., it's standard practice to put the Cassandra commitlog on a separate device so it can do its sequential writes unimpeded by random i/o from read requests.)</li> <li>Cassandra allows you to choose how strong you want it to require consistency to be on a per-operation basis. Sometimes this is misunderstood as "Cassandra does not give you strong consistency," but that is incorrect.</li> <li>Cassandra offers RandomPartitioner as well as the more Bigtable-like OrderedPartitioner. RandomPartitioner is much less prone to hot spots.</li> <li>Cassandra offers on- or off-heap caching with performance comparable to memcached, but without the cache consistency problems or complexity of requiring extra moving parts</li> <li>Non-Java clients are not second-class citizens</li> </ul> To my knowledge, the main advantage HBase has right now (HBase 0.90.4 and Cassandra 0.8.4) is that Cassandra does not yet support transparent data compression. (This has been added for Cassandra 1.0, due in early October, but today that is a real advantage for HBase.) HBase may also be better optimized for the kinds of range scans done by Hadoop batch processing. There are also some things that are not necessarily better, or worse, just different. HBase adheres more strictly to the Bigtable data model, where each column is versioned implicitly. Cassandra drops versioning, and adds SuperColumns instead. Hope that helps!

Large scale data processing Hbase vs Cassandra [closed]

Tags:

nosql

cassandra

hadoop

hbase

data-processing

I am nearly landed at Cassandra after my research on large scale data storage solutions. But its generally said that Hbase is better solution for large scale data processing and analysis.

While both are same key/value storage and both are/can run (Cassandra recently) Hadoop layer then what makes Hadoop a better candidate when processing/analysis is required on large data.

I also found good details about both at http://ria101.wordpress.com/2010/02/24/hbase-vs-cassandra-why-we-moved/

but I'm still looking for concrete advantages of Hbase.

While I am more convinced about Cassandra because its simplicity for adding nodes and seamless replication and no point of failure features. And it also keeps secondary index feature so its a good plus.

337

asked Aug 29 '11 23:08

Gary Lindahl

1 Answers

As a Cassandra developer, I'm better at answering the other side of the question:

Cassandra scales better. Cassandra is known to scale to over 400 nodes in a cluster; when Facebook deployed Messaging on top of HBase they had to shard it across 100-node HBase sub-clusters.
Cassandra supports hundreds, even thousands of ColumnFamilies. "HBase currently does not do well with anything above two or three column families."
As a fully distributed system with no "special" nodes or processes, Cassandra is simpler to set up and operate, easier to troubleshoot, and more robust.
Cassandra's support for multi-master replication means that not only do you get the obvious power of multiple datacenters -- geographic redundancy, local latencies -- but you can also split realtime and analytical workloads into separate groups, with realtime, bidirectional replication between them. If you don't split those workloads apart they will contend spectacularly.
Because each Cassandra node manages its own local storage, Cassandra has a substantial performance advantage that is unlikely to be narrowed significantly. (E.g., it's standard practice to put the Cassandra commitlog on a separate device so it can do its sequential writes unimpeded by random i/o from read requests.)
Cassandra allows you to choose how strong you want it to require consistency to be on a per-operation basis. Sometimes this is misunderstood as "Cassandra does not give you strong consistency," but that is incorrect.
Cassandra offers RandomPartitioner as well as the more Bigtable-like OrderedPartitioner. RandomPartitioner is much less prone to hot spots.
Cassandra offers on- or off-heap caching with performance comparable to memcached, but without the cache consistency problems or complexity of requiring extra moving parts
Non-Java clients are not second-class citizens

To my knowledge, the main advantage HBase has right now (HBase 0.90.4 and Cassandra 0.8.4) is that Cassandra does not yet support transparent data compression. (This has been added for Cassandra 1.0, due in early October, but today that is a real advantage for HBase.) HBase may also be better optimized for the kinds of range scans done by Hadoop batch processing.

There are also some things that are not necessarily better, or worse, just different. HBase adheres more strictly to the Bigtable data model, where each column is versioned implicitly. Cassandra drops versioning, and adds SuperColumns instead.

Hope that helps!

174

answered Sep 19 '22 08:09

jbellis

Related questions
                            
                                Mongodb: Failed to connect to 127.0.0.1:27017, reason: errno:10061
                            
                                What's the difference between insert(), insertOne(), and insertMany() method?
                            
                                NoSQL for mobile apps? [closed]
                            
                                How to find a substring in a field in Mongodb
                            
                                Azure Table Vs MongoDB on Azure
                            
                                MongoDB normalization, foreign key and joining
                            
                                Practical example for each type of database (real cases) [closed]
                            
                                Query with match by multiple fields
                            
                                How reliable is ElasticSearch as a primary datastore against factors like write loss, data availability
                            
                                Update field in exact element array in MongoDB
                            
                                Why many refer to Cassandra as a Column oriented database?
                            
                                Retrieving/Listing all key/value pairs in a Redis db
                            
                                When shouldn't you use a relational database? [closed]
                            
                                What type of NoSQL database is best suited to store hierarchical data?
                            
                                Pros/cons of document-based databases vs. relational databases
                            
                                Explain Merkle Trees for use in Eventual Consistency
                            
                                Transactions in NoSQL?
                            
                                Can I do transactions and locks in CouchDB?
                            
                                SQL versus noSQL (speed)
                            
                                Differences between OT and CRDT

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With