How do you read and write from/into different ElasticSearch clusters using spark and elasticsearch-hadoop?

Tags:

Original title: Besides HDFS, what other DFS does spark support (and are recommeded)?

I am happily using spark and elasticsearch (with elasticsearch-hadoop driver) with several gigantic clusters.

From time to time, I would like to pull the entire cluster of data out, process each doc, and put all of them into a different Elasticsearch (ES) cluster (yes, data migration too).

Currently, there is no way to read ES data from a cluster into RDDs and write the RDDs into a different one with spark + elasticsearch-hadoop, because that would involve swapping SparkContext from RDD. So I would like to write the RDD into object files and then later on read them back into RDDs with different SparkContexts.

However, here comes the problem: I then need a DFS(Distributed File System) to share the big files across my entire spark cluster. The most popular solution is HDFS, but I would very much avoid introducing Hadoop into my stack. Is there any other recommended DFS that spark supports?

Update Below

Thanks to @Daniel Darabos's answer below, I can now read and write data from/into different ElasticSearch clusters using the following Scala code:

val conf = new SparkConf().setAppName("Spark Migrating ES Data")
conf.set("es.nodes", "from.escluster.com")

val sc = new SparkContext(conf)

val allDataRDD = sc.esRDD("some/lovelydata")

val cfg = Map("es.nodes" -> "to.escluster.com")
allDataRDD.saveToEsWithMeta("clone/lovelydata", cfg)

507

asked Mar 12 '15 01:03

Winston Chen

1 Answers

Spark uses the hadoop-common library for file access, so whatever file systems Hadoop supports will work with Spark. I've used it with HDFS, S3 and GCS.

I'm not sure I understand why you don't just use elasticsearch-hadoop. You have two ES clusters, so you need to access them with different configurations. sc.newAPIHadoopFile and rdd.saveAsHadoopFile take hadoop.conf.Configuration arguments. So you can without any problems use two ES clusters with the same SparkContext.

answered Oct 22 '22 13:10

Daniel Darabos

Related questions
                            
                                Spark Structured Streaming Writestream to Hive ORC Partioned External Table
                            
                                How to set SPARK_LOCAL_DIRS parameter using spark-env.sh file
                            
                                GC Logs Overwritten when JVM Crashes
                            
                                Spark Structured Streaming Checkpoint Compatibility
                            
                                What can cause a stage to reattempt in Spark
                            
                                Zeppelin does not display stack trace
                            
                                Using .where() on pyspark.sql.functions.max().over(window) on Spark 2.4 throws Java exception
                            
                                Rerun Scala code with -deprecation using Apache Zeppelin
                            
                                one-hot encode of multiple string categorical features using Spark DataFrames
                            
                                Getting error while reading from S3 server using pyspark : [java.lang.IllegalArgumentException]
                            
                                Spark/k8s: How to run spark submit on Kubernetes with client mode
                            
                                Aggregate while dropping duplicates in pyspark
                            
                                Spark not ignoring empty partitions
                            
                                Low parallelism when running Apache Beam wordcount pipeline on Spark with Python SDK
                            
                                How to run a Spark-java program from command line [closed]
                            
                                Apache Spark Throws java.lang.IllegalStateException: unread block data
                            
                                Spark Standalone Mode multiple shell sessions (applications)
                            
                                Specifying the output file name in Apache Spark
                            
                                Spark - convert string IDs to unique integer IDs
                            
                                Usage of local variables in closures when accessing Spark RDDs

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do you read and write from/into different ElasticSearch clusters using spark and elasticsearch-hadoop?

Tags:

elasticsearch

apache-spark

hdfs

distributed-filesystem

elasticsearch-hadoop

Winston Chen

People also ask

1 Answers

Daniel Darabos

Recent Activity

Donate For Us