Can apache spark run without hadoop?

Question

Are there any dependencies between Spark and Hadoop?

If not, are there any features I'll miss when I run Spark without Hadoop?

Ravindra babu · Accepted Answer

Spark is an in-memory distributed computing engine.

Hadoop is a framework for distributed storage (HDFS) and distributed processing (YARN).

Spark can run with or without Hadoop components (HDFS/YARN)

Distributed Storage:

Since Spark does not have its own distributed storage system, it has to depend on one of these storage systems for distributed computing.

S3 – Non-urgent batch jobs. S3 fits very specific use cases when data locality isn’t critical.

Cassandra – Perfect for streaming data analysis and an overkill for batch jobs.

HDFS – Great fit for batch jobs without compromising on data locality.

Distributed processing:

You can run Spark in three different modes: Standalone, YARN and Mesos

Have a look at the below SE question for a detailed explanation about both distributed storage and distributed processing.

Which cluster type should I choose for Spark?

Arnon Rotem-Gal-Oz · Answer

Spark can run without Hadoop but some of its functionality relies on Hadoop's code (e.g. handling of Parquet files). We're running Spark on Mesos and S3 which was a little tricky to set up but works really well once done (you can read a summary of what needed to properly set it here).

(Edit) Note: since version 2.3.0 Spark also added native support for Kubernetes

Can apache spark run without hadoop?

Tags:

amazon-s3

apache-spark

hadoop

mapreduce

mesos

tourist

2 Answers

Distributed Storage:

Distributed processing:

Ravindra babu

Arnon Rotem-Gal-Oz

Recent Activity

Donate For Us

Can apache spark run without hadoop?

Tags:

amazon-s3

apache-spark

hadoop

mapreduce

mesos

tourist

2 Answers

Distributed Storage:

Distributed processing:

Ravindra babu

Arnon Rotem-Gal-Oz

Related questions

Recent Activity

Donate For Us