Run Bash script on GCP Dataproc

Question

I want to run shell script on Dataproc which will execute my Pig scripts with arguments. These arguments are always dynamic and are calculated by shell script.

Currently this scripts are running on AWS with the help of script-runner.jar. I am not sure how to move this to Dataproc. Is there anything similar which is available for Dataproc?

Or I will have to change all my scripts and calculate the arguments in Pig with the help of pig sh or pig fs?

Dennis Huo · Accepted Answer

As Aniket mentions, pig sh would itself be considered the script-runner for Dataproc jobs; instead of having to turn your wrapper script into a Pig script in itself, just use Pig to bootstrap any bash script you want to run. For example, suppose you have an arbitrary bash script hello.sh:

gsutil cp hello.sh gs://${BUCKET}/hello.sh
gcloud dataproc jobs submit pig --cluster ${CLUSTER} \
    -e 'fs -cp -f gs://${BUCKET}/hello.sh file:///tmp/hello.sh; sh chmod 750 /tmp/hello.sh; sh /tmp/hello.sh'

The pig fs command uses Hadoop paths so to copy your script from GCS you must copy to a destination specified as file:/// to make sure it's on the local filesystem instead of HDFS; then the sh commands afterwards will be referencing local filesystem automatically so you don't use file:/// there.

Alternatively, you can take advantage of the way --jars works to automatically stage a file into the temporary directory created just for your Pig job rather than explicitly copying from GCS into a local directory; you simply specify your shell script itself as a --jars argument:

gcloud dataproc jobs submit pig --cluster ${CLUSTER} \
    --jars hello.sh \
    -e 'sh chmod 750 ${PWD}/hello.sh; sh ${PWD}/hello.sh'

Or:

gcloud dataproc jobs submit pig --cluster ${CLUSTER} \
    --jars gs://${BUCKET}/hello.sh \
    -e 'sh chmod 750 ${PWD}/hello.sh; sh ${PWD}/hello.sh'

In these cases, the script would only temporarily be downloaded into a directory that looks like /tmp/59bc732cd0b542b5b9dcc63f112aeca3 and which only exists for the lifetime of the pig job.

Aniket Mokashi · Answer

There is no shell job in Dataproc at the moment. As an alternative, you can use a use a pig job with sh command that forks your shell script which can then (again) run your pig job. (You can use pyspark similarly if you prefer python). For example-

# cat a.sh
HELLO=hello
pig -e "sh echo $HELLO"
# pig -e "sh $PWD/a.sh"

Run Bash script on GCP Dataproc

Tags:

apache-pig

google-cloud-dataproc

Foram Shah

2 Answers

Dennis Huo

Aniket Mokashi

Recent Activity

Donate For Us

Run Bash script on GCP Dataproc

Tags:

apache-pig

google-cloud-dataproc

Foram Shah

2 Answers

Dennis Huo

Aniket Mokashi

Related questions

Recent Activity

Donate For Us