Dataproc does not unpack files passed as Archive

Question

I'm trying to submit Dataproc with .NET spark Job.

The command line looks like:

gcloud dataproc jobs submit spark \
    --cluster=<cluster> \
    --region=<region> \
    --class=org.apache.spark.deploy.dotnet.DotnetRunner \
    --jars=gs://bucket/microsoft-spark-2.4.x-0.11.0.jar \
    --archives=gs://bucket/dotnet-build-output.zip \
    -- find

This command line should call find function to show the files in the current directory.

And I see only 2 files:

././microsoft-spark-2.4.x-0.11.0.jar
././microsoft-spark-2.4.x-0.11.0.jar.crc

Eventually GCP does not unpack the file from Storage specified as --archives. The specified file exists and the path was copied from GCP UI. Also I tried to run an exact assembly file from the archive (that exists), but it reasonably fails with File does not exist

Dagang · Accepted Answer

I think the problem is that your command ran in Spark driver which ran on the master node, because Dataproc runs in client mode by default. You can change it by adding --properties spark.submit.deployMode=cluster when submitting the job.

According to the usage help of the --archives flag:

 --archives=[ARCHIVE,...]
   Comma separated list of archives to be extracted into the working
   directory of each executor. Must be one of the following file formats:
   .zip, .tar, .tar.gz, or .tgz.

The archive will only be copied to both driver and executor dirs, but will only be extracted for executors. I tested submitting a job with --archives=gs://my-bucket/foo.zip which includes 2 files foo.txt and deps.txt, then I could find the extracted files on worker nodes:

my-cluster-w-0:~$ sudo ls -l /hadoop/yarn/nm-local-dir/usercache/root/filecache/40/foo.zip/

total 4
-r-x------ 1 yarn yarn 11 Jul  2 22:09 deps.txt
-r-x------ 1 yarn yarn  0 Jul  2 22:09 foo.txt

Dataproc does not unpack files passed as Archive

Tags:

.net

google-cloud-platform

apache-spark

google-cloud-dataproc

deeptowncitizen

1 Answers

Dagang

Recent Activity

Donate For Us

Dataproc does not unpack files passed as Archive

Tags:

.net

google-cloud-platform

apache-spark

google-cloud-dataproc

deeptowncitizen

1 Answers

Dagang

Related questions

Recent Activity

Donate For Us