Comparing Cassandra's CQL vs Spark/Shark queries vs Hive/Hadoop (DSE version)

Tags:

I would like to hear your thoughts and experiences on the usage of CQL and in-memory query engine Spark/Shark. From what I know, CQL processor is running inside Cassandra JVM on each node. Shark/Spark query processor attached with a Cassandra cluster is running outside in a separated cluster. Also, Datastax has DSE version of Cassandra which allows to deploy Hadoop/Hive. The question is in which use case we would pick a specific solution instead of the other.

470

asked Jun 14 '13 17:06

Minh Do

1 Answers

I will share a few thoughts based on my experience. But, if possible for you, please let us know about your use-case. It'll help us in answering your queries in a better manner.

1- If you are going to have more writes than reads, Cassandra is obviously a good choice. Having said that, if you are coming from SQL background and planning to use Cassandra then you'll definitely find CQL very helpful. But if you need to perform operations like JOIN and GROUP BY, even though CQL solves primitive GROUP BY use cases through write time and compact time sorts and implements one-to-many relationships, CQL is not the answer.

2- Spark SQL (Formerly Shark) is very fast for the two reasons, in-memory processing and planning data pipelines. In-memory processing makes it ~100x faster than Hive. Like Hive, Spark SQL handles larger than memory data types very well and up to 10x faster thanks to planned pipelines. Situation shifts to Spark SQL benefit when multiple data pipelines like filter and groupBy are present. Go for it when you need ad-hoc real time querying. Not suitable when you need long running jobs over gigantic amounts of data.

3- Hive is basically a warehouse that runs on top of your existing Hadoop cluster and provides you SQL like interface to handle your data. But Hive is not suitable for real-time needs. It is best suited for offline batch processing. Doesn't need any additional infra as it uses underlying HDFS for data storage. Go for it when you have to perform operations like JOIN, GROUP BY etc on large dataset and for OLAP.

Note : Spark SQL emulates Apache Hive behavior on top of Spark, so it supports virtually all Hive features but potentially faster. It supports the existing Hive Query language, Hive data formats (SerDes), user-defined functions (UDFs), and queries that call external scripts.

But I think you will be able to evaluate the pros and cons of all these tools properly only after getting your hands dirty. I could just suggest based on your questions.

Hope this answers some of your queries.

P.S. : The above answer is based on solely my experience. Comments/corrections are welcome.

115

answered Sep 24 '22 02:09

Tariq

Related questions
                            
                                Upgrading Cassandra from 2.2 to 3.0 in RHEL
                            
                                Cassandra Full-Text Search
                            
                                How to use Cassandra's Map Reduce with or w/o Pig?
                            
                                What causes "no viable alternative at input 'None'" error with Cassandra CQL
                            
                                Are there any examples/tutorials of using Spring 3.0 with Cassandra as a backend? [closed]
                            
                                Cassandra CQL not equal operator on any column
                            
                                Cassandra UPDATE primary key value
                            
                                What are the alternative ways to model M:M relations in Cassandra?
                            
                                Insert to cassandra from python using cql
                            
                                Cassandra ttl on a row
                            
                                How to rename keyspace in Cassandra?
                            
                                cassandra - only superuser is allowed to perform CREATE USER queries
                            
                                Do I absolutely need a minimum of 3 nodes/servers for a Cassandra cluster or will 2 suffice?
                            
                                HBase cassandra couchdb mongodb..any fundamental difference?
                            
                                Max. size of wide rows?
                            
                                Cassandra: backup entire keyspace [closed]
                            
                                How to add a delay to supervised process in supervisor - linux
                            
                                Apache Cassandra 3.7 CQLSH 'Unable to connect to any servers'
                            
                                Init script for Cassandra with docker-compose
                            
                                How do I set the consistency level of an individual CQL query in CQL3?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Comparing Cassandra's CQL vs Spark/Shark queries vs Hive/Hadoop (DSE version)

Tags:

cassandra

cql

apache-spark

hive

shark-sql

Minh Do

People also ask

1 Answers

Tariq

Recent Activity

Donate For Us