Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dataproc does not unpack files passed as Archive

I'm trying to submit Dataproc with .NET spark Job.

The command line looks like:

gcloud dataproc jobs submit spark \
    --cluster=<cluster> \
    --region=<region> \
    --class=org.apache.spark.deploy.dotnet.DotnetRunner \
    --jars=gs://bucket/microsoft-spark-2.4.x-0.11.0.jar \
    --archives=gs://bucket/dotnet-build-output.zip \
    -- find

This command line should call find function to show the files in the current directory.

And I see only 2 files:

././microsoft-spark-2.4.x-0.11.0.jar
././microsoft-spark-2.4.x-0.11.0.jar.crc

Eventually GCP does not unpack the file from Storage specified as --archives. The specified file exists and the path was copied from GCP UI. Also I tried to run an exact assembly file from the archive (that exists), but it reasonably fails with File does not exist

like image 441
deeptowncitizen Avatar asked Nov 07 '22 06:11

deeptowncitizen


1 Answers

I think the problem is that your command ran in Spark driver which ran on the master node, because Dataproc runs in client mode by default. You can change it by adding --properties spark.submit.deployMode=cluster when submitting the job.

According to the usage help of the --archives flag:

 --archives=[ARCHIVE,...]
   Comma separated list of archives to be extracted into the working
   directory of each executor. Must be one of the following file formats:
   .zip, .tar, .tar.gz, or .tgz.

The archive will only be copied to both driver and executor dirs, but will only be extracted for executors. I tested submitting a job with --archives=gs://my-bucket/foo.zip which includes 2 files foo.txt and deps.txt, then I could find the extracted files on worker nodes:

my-cluster-w-0:~$ sudo ls -l /hadoop/yarn/nm-local-dir/usercache/root/filecache/40/foo.zip/

total 4
-r-x------ 1 yarn yarn 11 Jul  2 22:09 deps.txt
-r-x------ 1 yarn yarn  0 Jul  2 22:09 foo.txt
like image 62
Dagang Avatar answered Nov 15 '22 10:11

Dagang