Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Apache Spark with HDFS vs. other distributed storage

On the Spark's FAQ it specifically says one doesn't have to use HDFS:

Do I need Hadoop to run Spark?

No, but if you run on a cluster, you will need some form of shared file system (for example, NFS mounted at the same path on each node). If you have this type of filesystem, you can just deploy Spark in standalone mode.

So, what are the advantages/disadvantages of using Apache Spark with HDFS vs. other distributed file systems (such as NFS) if I'm not planning to use Hadoop MapReduce? Will I be missing an important feature if I use NFS instead of HDFS for the nodes storage (for checkpoint, shuffle spill, etc)?

like image 324
ofirski Avatar asked Sep 12 '15 19:09

ofirski


People also ask

What are advantages of using Apache spark with Hadoop?

Well suited to machine learning algorithms – Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly. Run 100 times faster – Spark, analysis software can also speed jobs that run on the Hadoop data-processing platform.

Why is Spark better than HDFS?

Whereas Hadoop reads and writes files to HDFS, Spark processes data in RAM using a concept known as an RDD, Resilient Distributed Dataset. Spark can run either in stand-alone mode, with a Hadoop cluster serving as the data source, or in conjunction with Mesos.

Is Spark compatible with other file storage system?

Spark works with many other storage systems as well including AWS S3, HBase, and more. Many companies deploy Spark with Hadoop because one enhances the other.

Why industry prefer Apache spark over Hadoop for big data processing?

Apache Spark is potentially 100 times faster than Hadoop MapReduce. Apache Spark utilizes RAM and isn't tied to Hadoop's two-stage paradigm. Apache Spark works well for smaller data sets that can all fit into a server's RAM. Hadoop is more cost-effective for processing massive data sets.


1 Answers

After a few months and some experience with both NFS and HDFS, I can now answer my own question:

NFS allows to view/change files on a remote machines as if they were stored a local machine. HDFS can also do that, but it is distributed (as opposed to NFS) and also fault-tolerant and scalable.

The advantage of using NFS is the simplicity of setup, so I would probably use it for QA environments or small clusters. The advantage of HDFS is of course its fault-tolerance but a bigger advantage, IMHO, is the ability to utilize locality when HDFS is co-located with the Spark nodes which provides best performance for checkpoints, shuffle spill, etc.

like image 147
ofirski Avatar answered Oct 23 '22 17:10

ofirski