Share config files with spark-submit in cluster mode

Tags:

I've been running my spark jobs in "client" mode during development. I use "--file" to share config files with executors. Driver was reading config files locally. Now I want to deploy the job in "cluster" mode. I'm having difficulty sharing the config files with driver now.

Ex, I'm passing the config file name as extraJavaOptions to both driver and executors. I'm reading the file using SparkFiles.get()

  val configFile = org.apache.spark.SparkFiles.get(System.getProperty("config.file.name"))

This works well on the executors but fails on driver. I think the files are only shared with executors and not with the container where driver is running. One option is to keep the config files in S3. I wanted to check if this can be achieved using spark-submit.

> spark-submit --deploy-mode cluster --master yarn --driver-cores 2
> --driver-memory 4g --num-executors 4 --executor-cores 4 --executor-memory 10g \
> --files /home/hadoop/Streaming.conf,/home/hadoop/log4j.properties \
> --conf **spark.driver.extraJavaOptions**="-Dlog4j.configuration=log4j.properties
> -Dconfig.file.name=Streaming.conf" \
> --conf **spark.executor.extraJavaOptions**="-Dlog4j.configuration=log4j.properties
> -Dconfig.file.name=Streaming.conf" \
> --class ....

220

asked Oct 21 '16 14:10

Cheeko

2 Answers

You need to try the --properties-file option in Spark submit command.

For example properties file content

spark.key1=value1
spark.key2=value2

All the keys needs to be prefixed with spark.

then use the spark-submit command like this to pass the properties file.

bin/spark-submit --properties-file  propertiesfile.properties

Then in the code you can get the keys using below sparkcontext getConf method.

sc.getConf.get("spark.key1")  // returns value1

Once you get the key values, you can pass use it everywhere.

answered Oct 13 '22 15:10

Shankar

I found a solution for this problem in this thread.

You can give an alias for the file you submitted through --files by adding '#alias' at the end. By this trick, you should be able to access the files through their alias.

For example, the following code can run without an error.

spark-submit --master yarn-cluster --files test.conf#testFile.conf test.py

with test.py as:

path_f = 'testFile.conf'
try:
    f = open(path_f, 'r')
except:
    raise Exception('File not opened', 'EEEEEEE!')

and an empty test.conf

answered Oct 13 '22 17:10

Peter Pan

Related questions
                            
                                Why Does Spark Query (Load) from Oracle Is So Slow Comparing to SQOOP?
                            
                                Livy Server: return a dataframe as JSON?
                            
                                Online learning of LDA model in Spark
                            
                                Can Spark read data directly into a nested case class?
                            
                                Using airflow to run spark streaming jobs?
                            
                                Should cache and checkpoint be used together on DataSets? If so, how does this work under the hood?
                            
                                PySpark; DecimalType multiplication precision loss
                            
                                Understanding parallelism in Spark and Scala
                            
                                How to read XML files from apache spark framework?
                            
                                Change hadoop version using spark-ec2
                            
                                Spark SQL HiveContext - saveAsTable creates wrong schema
                            
                                Iterate through a Java RDD by row
                            
                                Is Spark zipWithIndex safe with parallel implementation?
                            
                                spark submit java.lang.ClassNotFoundException
                            
                                Differentiate driver code and work code in Apache Spark
                            
                                Returning Multiple Arrays from User-Defined Aggregate Function (UDAF) in Apache Spark SQL
                            
                                Unit testing with Spark dataframes
                            
                                Apache spark Hive, executable JAR with maven shade
                            
                                Non linear (DAG) ML pipelines in Apache Spark
                            
                                Pyspark socket timeout exception after application running for a while

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Share config files with spark-submit in cluster mode

Tags:

apache-spark

hadoop-yarn

spark-streaming

Cheeko

People also ask

2 Answers

Shankar

Peter Pan

Recent Activity

Donate For Us