Original title: Besides HDFS, what other DFS does spark support (and are recommeded)?
I am happily using spark and elasticsearch (with elasticsearch-hadoop driver) with several gigantic clusters.
From time to time, I would like to pull the entire cluster of data out, process each doc, and put all of them into a different Elasticsearch (ES) cluster (yes, data migration too).
Currently, there is no way to read ES data from a cluster into RDDs and write the RDDs into a different one with spark + elasticsearch-hadoop, because that would involve swapping SparkContext
from RDD. So I would like to write the RDD into object files and then later on read them back into RDDs with different SparkContext
s.
However, here comes the problem: I then need a DFS(Distributed File System) to share the big files across my entire spark cluster. The most popular solution is HDFS, but I would very much avoid introducing Hadoop into my stack. Is there any other recommended DFS that spark supports?
Update Below
Thanks to @Daniel Darabos's answer below, I can now read and write data from/into different ElasticSearch clusters using the following Scala code:
val conf = new SparkConf().setAppName("Spark Migrating ES Data")
conf.set("es.nodes", "from.escluster.com")
val sc = new SparkContext(conf)
val allDataRDD = sc.esRDD("some/lovelydata")
val cfg = Map("es.nodes" -> "to.escluster.com")
allDataRDD.saveToEsWithMeta("clone/lovelydata", cfg)
How does Spark relate to Apache Hadoop? Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat.
With dynamic extensions to existing Hadoop APIs, ES-Hadoop lets you easily move data bi-directionally between Elasticsearch and Hadoop while exposing HDFS as a repository for long-term archival. Partition awareness, failure handling, type conversions, and co-location are all done transparently.
What is a Spark cluster? A Spark cluster is a combination of a Driver Program, Cluster Manager, and Worker Nodes that work together to complete tasks. The SparkContext lets us coordinate processes across the cluster. The SparkContext sends tasks to the Executors on the Worker Nodes to run.
Spark uses the hadoop-common library for file access, so whatever file systems Hadoop supports will work with Spark. I've used it with HDFS, S3 and GCS.
I'm not sure I understand why you don't just use elasticsearch-hadoop
. You have two ES clusters, so you need to access them with different configurations. sc.newAPIHadoopFile
and rdd.saveAsHadoopFile
take hadoop.conf.Configuration
arguments. So you can without any problems use two ES clusters with the same SparkContext
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With