Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can apache spark run without hadoop?

Are there any dependencies between Spark and Hadoop?

If not, are there any features I'll miss when I run Spark without Hadoop?

like image 588
tourist Avatar asked Aug 15 '15 06:08

tourist


2 Answers

Spark is an in-memory distributed computing engine.

Hadoop is a framework for distributed storage (HDFS) and distributed processing (YARN).

Spark can run with or without Hadoop components (HDFS/YARN)


Distributed Storage:

Since Spark does not have its own distributed storage system, it has to depend on one of these storage systems for distributed computing.

S3 – Non-urgent batch jobs. S3 fits very specific use cases when data locality isn’t critical.

Cassandra – Perfect for streaming data analysis and an overkill for batch jobs.

HDFS – Great fit for batch jobs without compromising on data locality.


Distributed processing:

You can run Spark in three different modes: Standalone, YARN and Mesos

Have a look at the below SE question for a detailed explanation about both distributed storage and distributed processing.

Which cluster type should I choose for Spark?

like image 168
Ravindra babu Avatar answered Oct 12 '22 01:10

Ravindra babu


Spark can run without Hadoop but some of its functionality relies on Hadoop's code (e.g. handling of Parquet files). We're running Spark on Mesos and S3 which was a little tricky to set up but works really well once done (you can read a summary of what needed to properly set it here).

(Edit) Note: since version 2.3.0 Spark also added native support for Kubernetes

like image 30
Arnon Rotem-Gal-Oz Avatar answered Oct 12 '22 02:10

Arnon Rotem-Gal-Oz