Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do you read and write from/into different ElasticSearch clusters using spark and elasticsearch-hadoop?

Original title: Besides HDFS, what other DFS does spark support (and are recommeded)?

I am happily using spark and elasticsearch (with elasticsearch-hadoop driver) with several gigantic clusters.

From time to time, I would like to pull the entire cluster of data out, process each doc, and put all of them into a different Elasticsearch (ES) cluster (yes, data migration too).

Currently, there is no way to read ES data from a cluster into RDDs and write the RDDs into a different one with spark + elasticsearch-hadoop, because that would involve swapping SparkContext from RDD. So I would like to write the RDD into object files and then later on read them back into RDDs with different SparkContexts.

However, here comes the problem: I then need a DFS(Distributed File System) to share the big files across my entire spark cluster. The most popular solution is HDFS, but I would very much avoid introducing Hadoop into my stack. Is there any other recommended DFS that spark supports?

Update Below

Thanks to @Daniel Darabos's answer below, I can now read and write data from/into different ElasticSearch clusters using the following Scala code:

val conf = new SparkConf().setAppName("Spark Migrating ES Data")
conf.set("es.nodes", "from.escluster.com")

val sc = new SparkContext(conf)

val allDataRDD = sc.esRDD("some/lovelydata")

val cfg = Map("es.nodes" -> "to.escluster.com")
allDataRDD.saveToEsWithMeta("clone/lovelydata", cfg)
like image 507
Winston Chen Avatar asked Mar 12 '15 01:03

Winston Chen


People also ask

How does Spark and Hadoop work together?

How does Spark relate to Apache Hadoop? Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat.

How does Elasticsearch work with Hadoop?

With dynamic extensions to existing Hadoop APIs, ES-Hadoop lets you easily move data bi-directionally between Elasticsearch and Hadoop while exposing HDFS as a repository for long-term archival. Partition awareness, failure handling, type conversions, and co-location are all done transparently.

What is Spark clustering?

What is a Spark cluster? A Spark cluster is a combination of a Driver Program, Cluster Manager, and Worker Nodes that work together to complete tasks. The SparkContext lets us coordinate processes across the cluster. The SparkContext sends tasks to the Executors on the Worker Nodes to run.


1 Answers

Spark uses the hadoop-common library for file access, so whatever file systems Hadoop supports will work with Spark. I've used it with HDFS, S3 and GCS.

I'm not sure I understand why you don't just use elasticsearch-hadoop. You have two ES clusters, so you need to access them with different configurations. sc.newAPIHadoopFile and rdd.saveAsHadoopFile take hadoop.conf.Configuration arguments. So you can without any problems use two ES clusters with the same SparkContext.

like image 55
Daniel Darabos Avatar answered Oct 22 '22 13:10

Daniel Darabos