I am trying to utilize the --archives option available in spark-on-yarn in order to upload an archive file. Based on the documentation & as mentioned in this question, yarn will not only upload the zip file but will also automatically unarchive the zip file on the worker nodes.
From the logs, I can see that yarn is uploading the jar in spark's staging directory e.g.
17/09/19 01:28:57 INFO Client: Uploading resource file:/home/foo/bar/zoo.zip -> hdfs://abc.foo.bar:8020/user/xyz/.sparkStaging/application_1503584958553_4501/zoo.zip
The issue I am facing is that, although the zip file is getting copied into spark staging directory, it's not getting automatically unarchived & I am guessing it's also not getting copied in the worker nodes.
Assuming yarn does unarchive the zip files, is there a way to access the location of worker nodes programmatically?
I am running spark 2.2 against emr 5.8 which is having yarn 2.7.
To unarchive the zip into your desired directory you need to give following values
--archives src.zip#src
This means that the src.zip will be uploaded to all executors and unarchived into "src" directory. Another example to make it clearer -
--archives src.zip#abc
If you change the directory name (string after #) like above, now src.zip will be unarchived into "abc" directory.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With