Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Passing command line arguments to Spark-shell

Tags:

apache-spark

I have a spark job written in scala. I use

spark-shell -i <file-name>

to run the job. I need to pass a command-line argument to the job. Right now, I invoke the script through a linux task, where I do

export INPUT_DATE=2015/04/27 

and use the environment variable option to access the value using:

System.getenv("INPUT_DATE")

Is there a better way to handle the command line arguments in Spark-shell?

like image 838
Jeevs Avatar asked Apr 28 '15 20:04

Jeevs


People also ask

How do you pass arguments to spark shell?

Making this more systematic: Put the code below in a script (e.g. spark-script.sh ), and then you can simply use: ./spark-script.sh your_file. scala first_arg second_arg third_arg , and have an Array[String] called args with your arguments.

How do I pass an argument to a shell script?

Using arguments Inside the script, we can use the $ symbol followed by the integer to access the arguments passed. For example, $1 , $2 , and so on. The $0 will contain the script name.

How do I pass a command line argument in scala?

The arguments which are passed by the user or programmer to the main() method are termed as Command-Line Arguments. main() method is the entry point of execution of a program. main() method accepts an array of strings.


2 Answers

My solution is use a customized key to define arguments instead of spark.driver.extraJavaOptions, in case someday you pass in a value that may interfere JVM's behavior.

spark-shell -i your_script.scala --conf spark.driver.args="arg1 arg2 arg3"

You can access the arguments from within your scala code like this:

val args = sc.getConf.get("spark.driver.args").split("\\s+")
args: Array[String] = Array(arg1, arg2, arg3)
like image 144
soulmachine Avatar answered Sep 17 '22 19:09

soulmachine


Short answer:

spark-shell -i <(echo val theDate = $INPUT_DATE ; cat <file-name>)

Long answer:

This solution causes the following line to be added at the beginning of the file before passed to spark-submit:

val theDate = ...,

thereby defining a new variable. The way this is done (the <( ... ) syntax) is called process substitution. It is available in Bash. See this question for more on this, and for alternatives (e.g. mkFifo) for non-Bash environments.

Making this more systematic:

Put the code below in a script (e.g. spark-script.sh), and then you can simply use:

./spark-script.sh your_file.scala first_arg second_arg third_arg, and have an Array[String] called args with your arguments.

The file spark-script.sh:

scala_file=$1

shift 1

arguments=$@

#set +o posix  # to enable process substitution when not running on bash 

spark-shell  --master yarn --deploy-mode client \
         --queue default \
        --driver-memory 2G --executor-memory 4G \
        --num-executors 10 \
        -i <(echo 'val args = "'$arguments'".split("\\s+")' ; cat $scala_file)
like image 24
Amir Avatar answered Sep 17 '22 19:09

Amir