On the Spark's FAQ it specifically says one doesn't have to use HDFS: <blockquote> Do I need Hadoop to run Spark? No, but if you run on a cluster, you will need some form of shared file system (for example, NFS mounted at the same path on each node). If you have this type of filesystem, you can just deploy Spark in standalone mode. </blockquote> So, what are the advantages/disadvantages of using Apache Spark with HDFS vs. other distributed file systems (such as NFS) if I'm not planning to use Hadoop MapReduce? Will I be missing an important feature if I use NFS instead of HDFS for the nodes storage (for checkpoint, shuffle spill, etc)?

After a few months and some experience with both NFS and HDFS, I can now answer my own question: NFS allows to view/change files on a remote machines as if they were stored a local machine. HDFS can also do that, but it is distributed (as opposed to NFS) and also fault-tolerant and scalable. The advantage of using NFS is the simplicity of setup, so I would probably use it for QA environments or small clusters. The advantage of HDFS is of course its fault-tolerance but a bigger advantage, IMHO, is the ability to utilize locality when HDFS is co-located with the Spark nodes which provides best performance for checkpoints, shuffle spill, etc.

Using Apache Spark with HDFS vs. other distributed storage

Tags:

apache-spark

nfs

On the Spark's FAQ it specifically says one doesn't have to use HDFS:

Do I need Hadoop to run Spark?

No, but if you run on a cluster, you will need some form of shared file system (for example, NFS mounted at the same path on each node). If you have this type of filesystem, you can just deploy Spark in standalone mode.

So, what are the advantages/disadvantages of using Apache Spark with HDFS vs. other distributed file systems (such as NFS) if I'm not planning to use Hadoop MapReduce? Will I be missing an important feature if I use NFS instead of HDFS for the nodes storage (for checkpoint, shuffle spill, etc)?

324

asked Sep 12 '15 19:09

ofirski

1 Answers

After a few months and some experience with both NFS and HDFS, I can now answer my own question:

NFS allows to view/change files on a remote machines as if they were stored a local machine. HDFS can also do that, but it is distributed (as opposed to NFS) and also fault-tolerant and scalable.

The advantage of using NFS is the simplicity of setup, so I would probably use it for QA environments or small clusters. The advantage of HDFS is of course its fault-tolerance but a bigger advantage, IMHO, is the ability to utilize locality when HDFS is co-located with the Spark nodes which provides best performance for checkpoints, shuffle spill, etc.

147

answered Oct 23 '22 17:10

ofirski

Related questions
                            
                                Save and load two ML models in pyspark
                            
                                Spark Structured streaming: multiple sinks
                            
                                Spark, Alternative to Fat Jar
                            
                                Extract words from a string column in spark dataframe
                            
                                SQL over Spark Streaming
                            
                                Get current task ID in Spark in Java
                            
                                Can I use Spark without Hadoop for development environment?
                            
                                spark.ml StringIndexer throws 'Unseen label' on fit()
                            
                                Scala - why Double consume less memory than Floats in this case?
                            
                                Filtering rows based on column values in spark dataframe scala
                            
                                How to add a column to Dataset without converting from a DataFrame and accessing it?
                            
                                AWS Glue write parquet with partitions
                            
                                pyspark partitioning data using partitionby
                            
                                Default number of executors and cores for spark-shell
                            
                                How to calculate Percentile of column in a DataFrame in spark?
                            
                                How to use a broadcast collection in a udf?
                            
                                How to group by common element in array?
                            
                                How to filter on partial match using sparklyr
                            
                                What is the difference between .sc and .scala file?
                            
                                How to print elements of particular RDD partition in Spark?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using Apache Spark with HDFS vs. other distributed storage

Tags:

apache-spark

nfs

ofirski

People also ask

1 Answers

ofirski

Recent Activity

Donate For Us