Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Share config files with spark-submit in cluster mode

I've been running my spark jobs in "client" mode during development. I use "--file" to share config files with executors. Driver was reading config files locally. Now I want to deploy the job in "cluster" mode. I'm having difficulty sharing the config files with driver now.

Ex, I'm passing the config file name as extraJavaOptions to both driver and executors. I'm reading the file using SparkFiles.get()

  val configFile = org.apache.spark.SparkFiles.get(System.getProperty("config.file.name"))

This works well on the executors but fails on driver. I think the files are only shared with executors and not with the container where driver is running. One option is to keep the config files in S3. I wanted to check if this can be achieved using spark-submit.

> spark-submit --deploy-mode cluster --master yarn --driver-cores 2
> --driver-memory 4g --num-executors 4 --executor-cores 4 --executor-memory 10g \
> --files /home/hadoop/Streaming.conf,/home/hadoop/log4j.properties \
> --conf **spark.driver.extraJavaOptions**="-Dlog4j.configuration=log4j.properties
> -Dconfig.file.name=Streaming.conf" \
> --conf **spark.executor.extraJavaOptions**="-Dlog4j.configuration=log4j.properties
> -Dconfig.file.name=Streaming.conf" \
> --class ....
like image 220
Cheeko Avatar asked Oct 21 '16 14:10

Cheeko


People also ask

How do I run Spark-submit in cluster mode?

You can submit a Spark batch application by using cluster mode (default) or client mode either inside the cluster or from an external client: Cluster mode (default): Submitting Spark batch application and having the driver run on a host in your driver resource group. The spark-submit syntax is --deploy-mode cluster.

How do I submit a PySpark job in cluster mode?

Using --deploy-mode , you specify where to run the PySpark application driver program. Spark support cluster and client deployment modes. In cluster mode, the driver runs on one of the worker nodes, and this node shows as a driver on the Spark Web UI of your application. cluster mode is used to run production jobs.

How do I run a spark driver in Cluster Mode?

Using spark-submit --deploy-mode <client/cluster> , you can specify where to run the Spark application driver program. In cluster mode, the driver runs on one of the worker nodes, and this node shows as a driver on the Spark Web UI of your application. cluster mode is used to run production jobs.

What is the difference between client vs cluster deploy modes in spark?

Difference between Client vs Cluster deploy modes in Spark/PySpark is the most asked interview question – Spark deployment mode (--deploy-mode) specifies where to run the driver program of your Spark application/job, Spark provides two deployment modes, client and cluster, you could use these to run Java, Scala, and PySpark applications.

How can I run sparkcontext on a cluster?

Specifically, to run on a cluster, the SparkContext can connect to several types of cluster managers (either Spark’s own standalone cluster manager, Mesos, YARN or Kubernetes), which allocate resources across applications.

How do I use spark-submit with default configuration?

The spark-submit script can load default Spark configuration values from a properties file and pass them on to your application. By default, it will read options from conf/spark-defaults.conf in the Spark directory. For more detail, see the section on loading default configurations.


2 Answers

You need to try the --properties-file option in Spark submit command.

For example properties file content

spark.key1=value1
spark.key2=value2

All the keys needs to be prefixed with spark.

then use the spark-submit command like this to pass the properties file.

bin/spark-submit --properties-file  propertiesfile.properties

Then in the code you can get the keys using below sparkcontext getConf method.

sc.getConf.get("spark.key1")  // returns value1

Once you get the key values, you can pass use it everywhere.

like image 64
Shankar Avatar answered Oct 13 '22 15:10

Shankar


I found a solution for this problem in this thread.

You can give an alias for the file you submitted through --files by adding '#alias' at the end. By this trick, you should be able to access the files through their alias.

For example, the following code can run without an error.

spark-submit --master yarn-cluster --files test.conf#testFile.conf test.py

with test.py as:

path_f = 'testFile.conf'
try:
    f = open(path_f, 'r')
except:
    raise Exception('File not opened', 'EEEEEEE!')

and an empty test.conf

like image 39
Peter Pan Avatar answered Oct 13 '22 17:10

Peter Pan