I want to run shell script on Dataproc which will execute my Pig scripts with arguments. These arguments are always dynamic and are calculated by shell script.
Currently this scripts are running on AWS with the help of script-runner.jar. I am not sure how to move this to Dataproc. Is there anything similar which is available for Dataproc?
Or I will have to change all my scripts and calculate the arguments in Pig with the help of pig sh
or pig fs
?
As Aniket mentions, pig sh
would itself be considered the script-runner for Dataproc jobs; instead of having to turn your wrapper script into a Pig script in itself, just use Pig to bootstrap any bash script you want to run. For example, suppose you have an arbitrary bash script hello.sh
:
gsutil cp hello.sh gs://${BUCKET}/hello.sh
gcloud dataproc jobs submit pig --cluster ${CLUSTER} \
-e 'fs -cp -f gs://${BUCKET}/hello.sh file:///tmp/hello.sh; sh chmod 750 /tmp/hello.sh; sh /tmp/hello.sh'
The pig fs
command uses Hadoop paths so to copy your script from GCS you must copy to a destination specified as file:///
to make sure it's on the local filesystem instead of HDFS; then the sh
commands afterwards will be referencing local filesystem automatically so you don't use file:///
there.
Alternatively, you can take advantage of the way --jars
works to automatically stage a file into the temporary directory created just for your Pig job rather than explicitly copying from GCS into a local directory; you simply specify your shell script itself as a --jars
argument:
gcloud dataproc jobs submit pig --cluster ${CLUSTER} \
--jars hello.sh \
-e 'sh chmod 750 ${PWD}/hello.sh; sh ${PWD}/hello.sh'
Or:
gcloud dataproc jobs submit pig --cluster ${CLUSTER} \
--jars gs://${BUCKET}/hello.sh \
-e 'sh chmod 750 ${PWD}/hello.sh; sh ${PWD}/hello.sh'
In these cases, the script would only temporarily be downloaded into a directory that looks like /tmp/59bc732cd0b542b5b9dcc63f112aeca3
and which only exists for the lifetime of the pig job.
There is no shell job in Dataproc at the moment. As an alternative, you can use a use a pig job with sh command that forks your shell script which can then (again) run your pig job. (You can use pyspark similarly if you prefer python). For example-
# cat a.sh
HELLO=hello
pig -e "sh echo $HELLO"
# pig -e "sh $PWD/a.sh"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With