Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Run Bash script on GCP Dataproc

I want to run shell script on Dataproc which will execute my Pig scripts with arguments. These arguments are always dynamic and are calculated by shell script.

Currently this scripts are running on AWS with the help of script-runner.jar. I am not sure how to move this to Dataproc. Is there anything similar which is available for Dataproc?

Or I will have to change all my scripts and calculate the arguments in Pig with the help of pig sh or pig fs?

like image 481
Foram Shah Avatar asked Dec 31 '22 14:12

Foram Shah


2 Answers

As Aniket mentions, pig sh would itself be considered the script-runner for Dataproc jobs; instead of having to turn your wrapper script into a Pig script in itself, just use Pig to bootstrap any bash script you want to run. For example, suppose you have an arbitrary bash script hello.sh:

gsutil cp hello.sh gs://${BUCKET}/hello.sh
gcloud dataproc jobs submit pig --cluster ${CLUSTER} \
    -e 'fs -cp -f gs://${BUCKET}/hello.sh file:///tmp/hello.sh; sh chmod 750 /tmp/hello.sh; sh /tmp/hello.sh'

The pig fs command uses Hadoop paths so to copy your script from GCS you must copy to a destination specified as file:/// to make sure it's on the local filesystem instead of HDFS; then the sh commands afterwards will be referencing local filesystem automatically so you don't use file:/// there.

Alternatively, you can take advantage of the way --jars works to automatically stage a file into the temporary directory created just for your Pig job rather than explicitly copying from GCS into a local directory; you simply specify your shell script itself as a --jars argument:

gcloud dataproc jobs submit pig --cluster ${CLUSTER} \
    --jars hello.sh \
    -e 'sh chmod 750 ${PWD}/hello.sh; sh ${PWD}/hello.sh'

Or:

gcloud dataproc jobs submit pig --cluster ${CLUSTER} \
    --jars gs://${BUCKET}/hello.sh \
    -e 'sh chmod 750 ${PWD}/hello.sh; sh ${PWD}/hello.sh'

In these cases, the script would only temporarily be downloaded into a directory that looks like /tmp/59bc732cd0b542b5b9dcc63f112aeca3 and which only exists for the lifetime of the pig job.

like image 113
Dennis Huo Avatar answered Mar 15 '23 14:03

Dennis Huo


There is no shell job in Dataproc at the moment. As an alternative, you can use a use a pig job with sh command that forks your shell script which can then (again) run your pig job. (You can use pyspark similarly if you prefer python). For example-

# cat a.sh
HELLO=hello
pig -e "sh echo $HELLO"
# pig -e "sh $PWD/a.sh"
like image 44
Aniket Mokashi Avatar answered Mar 15 '23 14:03

Aniket Mokashi