Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

spark on yarn and --archives option

I am trying to utilize the --archives option available in spark-on-yarn in order to upload an archive file. Based on the documentation & as mentioned in this question, yarn will not only upload the zip file but will also automatically unarchive the zip file on the worker nodes.

From the logs, I can see that yarn is uploading the jar in spark's staging directory e.g.

17/09/19 01:28:57 INFO Client: Uploading resource file:/home/foo/bar/zoo.zip -> hdfs://abc.foo.bar:8020/user/xyz/.sparkStaging/application_1503584958553_4501/zoo.zip

The issue I am facing is that, although the zip file is getting copied into spark staging directory, it's not getting automatically unarchived & I am guessing it's also not getting copied in the worker nodes.

Assuming yarn does unarchive the zip files, is there a way to access the location of worker nodes programmatically?

I am running spark 2.2 against emr 5.8 which is having yarn 2.7.

like image 572
Pawan Mishra Avatar asked Oct 17 '22 05:10

Pawan Mishra


1 Answers

To unarchive the zip into your desired directory you need to give following values

--archives src.zip#src

This means that the src.zip will be uploaded to all executors and unarchived into "src" directory. Another example to make it clearer -

--archives src.zip#abc

If you change the directory name (string after #) like above, now src.zip will be unarchived into "abc" directory.

like image 110
mental_matrix Avatar answered Oct 21 '22 07:10

mental_matrix