I'm trying to submit Dataproc with .NET spark Job.
The command line looks like:
gcloud dataproc jobs submit spark \
--cluster=<cluster> \
--region=<region> \
--class=org.apache.spark.deploy.dotnet.DotnetRunner \
--jars=gs://bucket/microsoft-spark-2.4.x-0.11.0.jar \
--archives=gs://bucket/dotnet-build-output.zip \
-- find
This command line should call find
function to show the files in the current directory.
And I see only 2 files:
././microsoft-spark-2.4.x-0.11.0.jar
././microsoft-spark-2.4.x-0.11.0.jar.crc
Eventually GCP does not unpack the file from Storage specified as --archives
. The specified file exists and the path was copied from GCP UI. Also I tried to run an exact assembly file from the archive (that exists), but it reasonably fails with File does not exist
I think the problem is that your command ran in Spark driver which ran on the master node, because Dataproc runs in client mode by default. You can change it by adding --properties spark.submit.deployMode=cluster
when submitting the job.
According to the usage help of the --archives
flag:
--archives=[ARCHIVE,...] Comma separated list of archives to be extracted into the working directory of each executor. Must be one of the following file formats: .zip, .tar, .tar.gz, or .tgz.
The archive will only be copied to both driver and executor dirs, but will only be extracted for executors. I tested submitting a job with --archives=gs://my-bucket/foo.zip
which includes 2 files foo.txt
and deps.txt
, then I could find the extracted files on worker nodes:
my-cluster-w-0:~$ sudo ls -l /hadoop/yarn/nm-local-dir/usercache/root/filecache/40/foo.zip/
total 4
-r-x------ 1 yarn yarn 11 Jul 2 22:09 deps.txt
-r-x------ 1 yarn yarn 0 Jul 2 22:09 foo.txt
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With