On the Spark's FAQ it specifically says one doesn't have to use HDFS:
Do I need Hadoop to run Spark?
No, but if you run on a cluster, you will need some form of shared file system (for example, NFS mounted at the same path on each node). If you have this type of filesystem, you can just deploy Spark in standalone mode.
So, what are the advantages/disadvantages of using Apache Spark with HDFS vs. other distributed file systems (such as NFS) if I'm not planning to use Hadoop MapReduce? Will I be missing an important feature if I use NFS instead of HDFS for the nodes storage (for checkpoint, shuffle spill, etc)?
Well suited to machine learning algorithms – Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly. Run 100 times faster – Spark, analysis software can also speed jobs that run on the Hadoop data-processing platform.
Whereas Hadoop reads and writes files to HDFS, Spark processes data in RAM using a concept known as an RDD, Resilient Distributed Dataset. Spark can run either in stand-alone mode, with a Hadoop cluster serving as the data source, or in conjunction with Mesos.
Spark works with many other storage systems as well including AWS S3, HBase, and more. Many companies deploy Spark with Hadoop because one enhances the other.
Apache Spark is potentially 100 times faster than Hadoop MapReduce. Apache Spark utilizes RAM and isn't tied to Hadoop's two-stage paradigm. Apache Spark works well for smaller data sets that can all fit into a server's RAM. Hadoop is more cost-effective for processing massive data sets.
After a few months and some experience with both NFS and HDFS, I can now answer my own question:
NFS allows to view/change files on a remote machines as if they were stored a local machine. HDFS can also do that, but it is distributed (as opposed to NFS) and also fault-tolerant and scalable.
The advantage of using NFS is the simplicity of setup, so I would probably use it for QA environments or small clusters. The advantage of HDFS is of course its fault-tolerance but a bigger advantage, IMHO, is the ability to utilize locality when HDFS is co-located with the Spark nodes which provides best performance for checkpoints, shuffle spill, etc.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With