In Submitting Applications in the Spark docs, as of 1.6.0 and earlier, it's not clear how to specify the --jars argument, as it's apparently not a colon-separated classpath not a directory expansion.
The docs say "Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes."
Question: What are all the options for submitting a classpath with --jars in the spark-submit script in $SPARK_HOME/bin? Anything undocumented that could be submitted as an improvement for docs?
I ask because when I was testing --jars today, we had to explicitly provide a path to each jar:
/usr/local/spark/bin/spark-submit --class jpsgcs.thold.PipeLinkageData ---jars=local:/usr/local/spark/jars/groovy-all-2.3.3.jar,local:/usr/local/spark/jars/guava-14.0.1.jar,local:/usr/local/spark/jars/jopt-simple-4.6.jar,local:/usr/local/spark/jars/jpsgcs-core-1.0.8-2.jar,local:/usr/local/spark/jars/jpsgcs-pipe-1.0.6-7.jar /usr/local/spark/jars/thold-0.0.1-1.jar
We are choosing to pre-populate the cluster with all the jars in /usr/local/spark/jars on each worker, it seemed that if no local:/ file:/ or hdfs: was supplied, then the default is file:/ and the driver makes the jars available on a webserver run by the driver. I chose local, as above.
And it seems that we do not need to put the main jar in the --jars argument, I have not tested yet if other classes in the final argument (application-jar arg per docs, i.e. /usr/local/spark/jars/thold-0.0.1-1.jar) are shipped to workers, or if I need to put the application-jar in the --jars path to get classes not named after --class to be seen.
(And granted with Spark standalone mode using --deploy-mode client, you also have to put a copy of the driver on each worker but you don't know up front which worker will run the driver)
You can also add jars using Spark submit option --jar , using this option you can add a single jar or multiple jars by comma-separated.
When using spark-submit , the application jar along with any jars included with the --jars option will be automatically transferred to the cluster. URLs supplied after --jars must be separated by commas. That list is included in the driver and executor classpaths. Directory expansion does not work with --jars .
Once you do a Spark submit, a driver program is launched and this requests for resources to the cluster manager and at the same time the main program of the user function of the user processing program is initiated by the driver program.
In this way it worked easily.. instead of specifying each jar with version separately..
#!/bin/sh
# build all other dependent jars in OTHER_JARS
JARS=`find ../lib -name '*.jar'`
OTHER_JARS=""
for eachjarinlib in $JARS ; do
if [ "$eachjarinlib" != "APPLICATIONJARTOBEADDEDSEPERATELY.JAR" ]; then
OTHER_JARS=$eachjarinlib,$OTHER_JARS
fi
done
echo ---final list of jars are : $OTHER_JARS
echo $CLASSPATH
spark-submit --verbose --class <yourclass>
... OTHER OPTIONS
--jars $OTHER_JARS,APPLICATIONJARTOBEADDEDSEPERATELY.JAR
tr
unix command also can help like the below example.--jars $(echo /dir_of_jars/*.jar | tr ' ' ',')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With