I have a directory with some model files and my application has to access these models files in local file system due to some reason.
Of course I know that --files
option of spark-submit
can upload file to the working directory of each executor and it does work.
However, I want keep the directory structure of my files so I come up with --archives
option, which is said
YARN-only:
......
--archives ARCHIVES Comma separated list of archives to be extracted into the working directory of each executor.
......
But when I actually use it to upload models.zip
, I found yarn just put it there without extraction, like what it did with --files
. Have I misunderstood to be extracted
or misused this option?
You can submit a Spark batch application by using cluster mode (default) or client mode either inside the cluster or from an external client: Cluster mode (default): Submitting Spark batch application and having the driver run on a host in your driver resource group. The spark-submit syntax is --deploy-mode cluster.
While a text file in GZip, BZip2, and other supported compression formats can be configured to be automatically decompressed in Apache Spark as long as it has the right file extension, you must perform additional steps to read zip files. The following notebooks show how to read zip files.
Spark Submit Python File Apache Spark binary comes with spark-submit.sh script file for Linux, Mac, and spark-submit. cmd command file for windows, these scripts are available at $SPARK_HOME/bin directory which is used to submit the PySpark file with . py extension (Spark with python) to the cluster.
Once you do a Spark submit, a driver program is launched and this requests for resources to the cluster manager and at the same time the main program of the user function of the user processing program is initiated by the driver program.
Found the answer myself.
YARN does extract the archive but add an extra folder with the same name of the archive. To make it clear, If I put models/model1
and models/models2
in models.zip
, then I have to access my models by models.zip/models/model1
and models.zip/models/model2
.
Moreover, we can make this more beautiful using the # syntax.
The --files and --archives options support specifying file names with the # similar to Hadoop. For example you can specify: --files localtest.txt#appSees.txt and this will upload the file you have locally named localtest.txt into HDFS but this will be linked to by the name appSees.txt, and your application should use the name as appSees.txt to reference it when running on YARN.
Edit:
This answer was tested on spark 2.0.0 and I'm not sure the behavior in other versions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With