Is HDFS necessary for Spark workloads?

Question

HDFS is not necessary but recommendations appear in some places.

To help evaluate the effort spent in getting HDFS running:

What are the benefits of using HDFS for Spark workloads?

Ravindra babu · Accepted Answer

Spark is a distributed processing engine and HDFS is a distributed storage system.

If HDFS is not an option, then Spark has to use some other alternative in form of Apache Cassandra Or Amazon S3.

Have a look at this comparision

S3 – Non urgent batch jobs. S3 fits very specific use cases, when data locality isn’t critical.

Cassandra – Perfect for streaming data analysis and an overkill for batch jobs.

HDFS – Great fit for batch jobs without compromising on data locality.

When to use HDFS as storage engine for Spark distributed processing?

If you have big Hadoop cluster already in place and looking for real time analytics of your data, Spark can use existing Hadoop cluster. It will reduce development time.
Spark is in-memory computing engine. Since data can't fit into memory always, data has to be spilled to disk for some operations. Spark will benifit from HDFS in this case. The Teragen sorting record achieved by Spark used HDFS storage for sorting operation.
HDFS is scalable, reliable and fault tolerant distributed file system ( since Hadoop 2.x release). With data locality principle, processing speed is improved.
Best for Batch-processing jobs.

Donate For Us