Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using typesafe config with Spark on Yarn

I have a Spark job that reads data from a configuration file. This file is a typesafe config file.

The code that reads the config looks like that:

ConfigFactory.load().getConfig("com.mycompany")

Now I don't assemble the application.conf as part of my uber jar since I want to pass the file as an external file

The content of the external application.conf I want to use looks like this:

com.mycompany {
  //configurations my program needs
}

This application.conf file exists on my local machine file system (and not on HDFS)

I'm using Spark 1.6.1 with Yarn

This is how my spark-submit command looks like:

LOG4J_FULL_PATH=/log4j-path
ROOT_DIR=/application.conf-path

    /opt/deploy/spark/bin/spark-submit \
    --class com.mycompany.Main \
    --master yarn \
    --deploy-mode cluster \
    --files $ROOT_DIR/application.conf \
    --files $LOG4J_FULL_PATH/log4j.xml \
    --conf spark.executor.extraClassPath="-Dconfig.file=file:application.conf" \
    --driver-class-path $ROOT_DIR/application.conf \
    --verbose \
    /opt/deploy/lal-ml.jar

The exception I receive is:

2016-11-09 12:32:14 ERROR ApplicationMaster:95 - User class threw exception: com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'com'
com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'com'
    at com.typesafe.config.impl.SimpleConfig.findKey(SimpleConfig.java:124)
    at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:147)
    at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:159)
    at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:164)
    at com.typesafe.config.impl.SimpleConfig.getObject(SimpleConfig.java:218)
    at com.typesafe.config.impl.SimpleConfig.getConfig(SimpleConfig.java:224)
    at com.typesafe.config.impl.SimpleConfig.getConfig(SimpleConfig.java:33)
    at com.mycompany.Main$.main(Main.scala:36)
    at com.mycompany.Main.main(Main.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:542)

And so my question is: does anybody know how I can load an external typesafe application.conf file that sit on my local machine with spark-submit and yarn?

I tried following some of the solutions in How to add a typesafe config file which is located on HDFS to spark-submit (cluster-mode)? and in Typesafe Config in Spark and also in How to pass -D parameter or environment variable to Spark job? and nothing worked

I'll appreciate any direction to solving this

Thanks in advance

like image 664
Gideon Avatar asked Nov 09 '16 12:11

Gideon


2 Answers

So with a little digging in the Spark 1.6.1 source code I found the solution.

These are the steps that you need to take in order to get both the log4j and the application.conf being used by your application when submitting to yarn using cluster mode:

  • When passing several files like I was doing passing both the application.conf and log4j.xml file you need to submit them using just one line like this: --files "$ROOT_DIR/application.conf,$LOG4J_FULL_PATH/log4j.xml" (separate them by comma)
  • Thats it for the application.conf. There's no need for the extraJavaOpts for the application.conf (as was written in my question). The issue is that Spark was using only the last --files argument that was passed and thats why log4j was being passed. In order to use log4j.xml I also had to take the following step
  • Add another line to the spark submit like this: --conf spark.driver.extraJavaOptions="-Dlog4j.configuration=file:log4j.xml" - notice that once you pass it with --files you can just refer to the file name without any path

Note: I haven't tried it but from what I saw if you're trying to run it in client mode I think the spark.driver.extraJavaOptions line should be renamed to something like driver-java-options Thats it. So simple and I wish these things were documented better. I hope this answer will help someone

Cheers

like image 128
Gideon Avatar answered Oct 31 '22 22:10

Gideon


Even though, it is a question from a year ago, I had a simmilar issue with the ConfigFactor. To be able to read application.conf file, you have to do two things.

  • Submit the file to the driver. This is done with the following code --files /path/to/file/application.conf. Note that you can read it from HDFS if you wish.
  • Submit the com.typesafe.config package. This is done with --packages com.typesafe:config:version.

Since the application.conf file will be at the same temporary directory than the main jar aplication, you can assume in your code.

Using the answer gave above (https://stackoverflow.com/a/40586476/6615465), the code for this question will be the following:

LOG4J_FULL_PATH=/log4j-path
ROOT_DIR=/application.conf-path

/opt/deploy/spark/bin/spark-submit \
--packages com.typesafe:config:1.3.2
--class com.mycompany.Main \
--master yarn \
--deploy-mode cluster \
--files "$ROOT_DIR/application.conf, $LOG4J_FULL_PATH/log4j.xml" \
--conf spark.executor.extraClassPath="-Dconfig.file=file:application.conf" \
--driver-class-path $ROOT_DIR/application.conf \
--verbose \
/opt/deploy/lal-ml.jar
like image 22
Antonio Méndez Avatar answered Oct 31 '22 21:10

Antonio Méndez