My understanding was that Spark is an alternative to Hadoop. However, when trying to install Spark, the installation page asks for an existing Hadoop installation. I'm not able to find anything that clarifies that relationship. Secondly, Spark apparently has good connectivity to Cassandra and Hive. Both have sql style interface. However, Spark has its own sql. Why would one use Cassandra/Hive instead of Spark's native sql? Assuming that this is a brand new project with no existing installation?

Im writing a paper about Hadoop for university. And stumbled over your question. Spark is just using Hadoop for persistence and only if you want to use it. It's possible to use it with other persistence tiers like Amazon EC2. On the other hand-side spark is running in-memory and it's not primarly build to be used for map reduce use-cases like Hadoop was/is. I can recommend this article, if you like a more detailed description: https://www.xplenty.com/blog/2014/11/apache-spark-vs-hadoop-mapreduce/

What is the relationship between Spark, Hadoop and Cassandra

Tags:

cassandra

apache-spark

apache-spark-sql

hadoop

My understanding was that Spark is an alternative to Hadoop. However, when trying to install Spark, the installation page asks for an existing Hadoop installation. I'm not able to find anything that clarifies that relationship.

Secondly, Spark apparently has good connectivity to Cassandra and Hive. Both have sql style interface. However, Spark has its own sql. Why would one use Cassandra/Hive instead of Spark's native sql? Assuming that this is a brand new project with no existing installation?

891

asked Jun 27 '15 15:06

Shahbaz

2 Answers

Spark is a distributed in memory processing engine. It does not need to be paired with Hadoop, but since Hadoop is one of the most popular big data processing tools, Spark is designed to work well in that environment. For example, Hadoop uses the HDFS (Hadoop Distributed File System) to store its data, so Spark is able to read data from HDFS, and to save results in HDFS.

For speed, Spark keeps its data sets in memory. It will typically start a job by loading data from durable storage, such as HDFS, Hbase, a Cassandra database, etc. Once loaded into memory, Spark can run many transformations on the data set to calculate a desired result. The final result is then typically written back to durable storage.

In terms of it being an alternative to Hadoop, it can be much faster than Hadoop at certain operations. For example a multi-pass map reduce operation can be dramatically faster in Spark than with Hadoop map reduce since most of the disk I/O of Hadoop is avoided. Spark can read data formatted for Apache Hive, so Spark SQL can be much faster than using HQL (Hive Query Language).

Cassandra has its own native query language called CQL (Cassandra Query Language), but it is a small subset of full SQL and is quite poor for things like aggregation and ad hoc queries. So when Spark is paired with Cassandra, it offers a more feature rich query language and allows you to do data analytics that native CQL doesn't provide.

Another use case for Spark is for stream processing. Spark can be set up to ingest incoming real time data and process it in micro-batches, and then save the result to durable storage, such as HDFS, Cassandra, etc.

So spark is really a standalone in memory system that can be paired with many different distributed databases and file systems to add performance, a more complete SQL implementation, and features they may lack such a stream processing.

113

answered Sep 30 '22 06:09

Jim Meyer

Im writing a paper about Hadoop for university. And stumbled over your question. Spark is just using Hadoop for persistence and only if you want to use it. It's possible to use it with other persistence tiers like Amazon EC2.

On the other hand-side spark is running in-memory and it's not primarly build to be used for map reduce use-cases like Hadoop was/is.

I can recommend this article, if you like a more detailed description: https://www.xplenty.com/blog/2014/11/apache-spark-vs-hadoop-mapreduce/

answered Sep 30 '22 06:09

sascha10000

Related questions
                            
                                Save Spark dataframe as dynamic partitioned table in Hive
                            
                                Hadoop 2.2 Installation `.' no such file or directory
                            
                                Just enough Java for Hadoop [closed]
                            
                                Hadoop one Map and multiple Reduce
                            
                                putting a remote file into hadoop without copying it to local disk
                            
                                What is Google's Dremel? How is it different from Mapreduce?
                            
                                Hadoop DistributedCache is deprecated - what is the preferred API?
                            
                                Easiest way to install Python dependencies on Spark executor nodes?
                            
                                Spark Unable to load native-hadoop library for your platform
                            
                                Where HDFS stores files locally by default?
                            
                                Difference between `yarn.scheduler.maximum-allocation-mb` and `yarn.nodemanager.resource.memory-mb`?
                            
                                Spark Scala list folders in directory
                            
                                Loading Data from a .txt file to Table Stored as ORC in Hive
                            
                                When using --negotiate with curl, is a keytab file required?
                            
                                view contents of file in hdfs hadoop
                            
                                List the namenode and datanodes of a cluster from any node?
                            
                                HBase REST Filter ( SingleColumnValueFilter )
                            
                                Why isn't Hadoop implemented using MPI?
                            
                                How do you make a HIVE table out of JSON data?
                            
                                Download large data for Hadoop [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With