Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can I add arguments to python code when I submit spark job?

I'm trying to use spark-submit to execute my python code in spark cluster.

Generally we run spark-submit with python code like below.

# Run a Python application on a cluster ./bin/spark-submit \   --master spark://207.184.161.138:7077 \   my_python_code.py \   1000 

But I wanna run my_python_code.pyby passing several arguments Is there smart way to pass arguments?

like image 226
Jinho Yoo Avatar asked Aug 26 '15 02:08

Jinho Yoo


People also ask

What happens when you submit spark job?

Once you do a Spark submit, a driver program is launched and this requests for resources to the cluster manager and at the same time the main program of the user function of the user processing program is initiated by the driver program.

How do you pass arguments in spark shell?

Making this more systematic: Put the code below in a script (e.g. spark-script.sh ), and then you can simply use: ./spark-script.sh your_file. scala first_arg second_arg third_arg , and have an Array[String] called args with your arguments.

Can you explain what happens internally when we submit a spark job using spark submit?

Spark translates the RDD transformations into something called DAG (Directed Acyclic Graph) and starts the execution, At high level, when any action is called on the RDD, Spark creates the DAG and submits to the DAG scheduler.

How do I submit a Python code in spark?

Spark Submit Python File Apache Spark binary comes with spark-submit.sh script file for Linux, Mac, and spark-submit. cmd command file for windows, these scripts are available at $SPARK_HOME/bin directory which is used to submit the PySpark file with . py extension (Spark with python) to the cluster.


2 Answers

Even though sys.argv is a good solution, I still prefer this more proper way of handling line command args in my PySpark jobs:

import argparse  parser = argparse.ArgumentParser() parser.add_argument("--ngrams", help="some useful description.") args = parser.parse_args() if args.ngrams:     ngrams = args.ngrams 

This way, you can launch your job as follows:

spark-submit job.py --ngrams 3 

More information about argparse module can be found in Argparse Tutorial

like image 75
noleto Avatar answered Oct 21 '22 20:10

noleto


Yes: Put this in a file called args.py

#import sys print sys.argv 

If you run

spark-submit args.py a b c d e  

You will see:

['/spark/args.py', 'a', 'b', 'c', 'd', 'e'] 
like image 38
Paul Avatar answered Oct 21 '22 19:10

Paul