In the Book Spark in Action, i am reading this:
“If you’re submitting your application in cluster-deploy mode using the spark-submit script, the JAR file you specify needs to be available on the worker (at the location you specified) that will be executing the application. Because there’s no way to say in advance which worker will execute your driver, you should put your application’s JAR file on all the workers if you intend to use cluster-deploy mode, or you can put your application’s JAR file on HDFS and use the HDFS URL as the JAR filename.”
But in the official documentation I see this:
1 - If your code depends on other projects, you will need to package them alongside your application in order to distribute the code to a Spark cluster. To do this, create an assembly jar (or “uber” jar) containing your code and its dependencies. Both sbt and Maven have assembly plugins. When creating assembly jars, list Spark and Hadoop as provided dependencies; these need not be bundled since they are provided by the cluster manager at runtime. Once you have an assembled jar you can call the bin/spark-submit script as shown here while passing your jar.
2-If your application is launched through Spark submit, then the application jar is automatically distributed to all worker nodes. For any additional jars that your application depends on, you should specify them through the --jars flag using comma as a delimiter (e.g. --jars jar1,jar2). To control the application’s configuration or execution environment, see Spark Configuration.
What am I missing here ? How does it work ? Do i need to deploy my assembly jar all over the cluster (expect for the master node) ?
The accepted answer from maasg (as well Ajit's answer) assumes that you are submitting to a YARN master. If that is the case, then indeed, your application jar will be provided to the cluster (via HDFS?) automatically.
However, if you are submitting to a Standalone master, and your deploy mode is cluster, then Spark does nothing to distribute your application jar.
The lack of this distinction in the official documentation is rather frustrating. The only place I've ever seen this mentioned is in a git commit comment for fixing SPARK-2260:
One thing that may or may not be an issue is that the jars must be available on the driver node. In
standalone-cluster
mode, this effectively means these jars must be available on all the worker machines, since the driver is launched on one of them. The semantics here are not the same asyarn-cluster mode
, where all the relevant jars are uploaded to a distributed cache automatically and shipped to the containers. This is probably not a concern, but still worth a mention.
TLDR: For a Standalone master, listen to "Spark in Action". For YARN, you won't have this problem.
If you have HDFS defined for your cluster there is no need to copy your application jar all over nodes. If your cluster without hdfs support in that case you need to copy all your application jars all over your worker/slaves implicitly with same path.
The official documentation is correct (as we would expect).
TL;DR:
There is no need to copy application files or dependencies across the cluster to submit a Spark job with spark-submit
.
spark-submit
takes care of delivering the application jar to the executors. Even more, the jar files specified using the --jars
option are also served by the file server on the driver program to all executors, so we don't need to copy any dependencies to the executors, either. Spark takes care of that for you.
Further details are available on the Advanced Dependency Management page
As you are running your job in cluster deployment mode the dependent JARS specified through --jars will be copied from local path to the containers on HDFS.
Following is the console output where you can see the application JAR(mapRedQA-1.0.0.jar) along with the required configurations(__spark_conf__5743283277173703345.zip) is uploaded to the container on HDFS which will be accessible for all executor nodes. That's why you no need to put the application JAR on worker nodes Spark will take care of it.
17/08/10 11:42:55 INFO yarn.Client: Preparing resources for our AM container
17/08/10 11:42:57 INFO yarn.YarnSparkHadoopUtil: getting token for namenode: hdfs://master.localdomain:8020/user/user1/.sparkStaging/application_1502271179925_0001
17/08/10 11:43:19 INFO hdfs.DFSClient: Created token for user1: HDFS_DELEGATION_TOKEN [email protected], renewer=yarn, realUser=, issueDate=1502379778376, maxDate=1502984578376, sequenceNumber=6144, masterKeyId=243 on 2.10.1.70:8020
17/08/10 11:43:25 INFO yarn.Client: Uploading resource file:/Automation/mapRedQA-1.0.0.jar -> hdfs://master.localdomain:8020/user/user1/.sparkStaging/application_1502271179925_0001/mapRedQA-1.0.0.jar
17/08/10 11:43:51 INFO yarn.Client: Uploading resource file:/tmp/spark-f4e913eb-17d5-4d5b-bf99-c8212715ceaa/__spark_conf__5743283277173703345.zip -> hdfs://master.localdomain:8020/user/user1/.sparkStaging/application_1502271179925_0001/__spark_conf__5743283277173703345.zip
17/08/10 11:43:52 INFO spark.SecurityManager: Changing view acls to: user1
17/08/10 11:43:52 INFO spark.SecurityManager: Changing modify acls to: user1
17/08/10 11:43:52 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(user1); users with modify permissions: Set(user1)
17/08/10 11:43:53 INFO yarn.Client: Submitting application 1 to ResourceManager
17/08/10 11:43:58 INFO impl.YarnClientImpl: Application submission is not finished, submitted application application_1502271179925_0001 is still in NEW
t: Application report for application_1502271179925_0001 (state: ACCEPTED)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With