Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark pyspark vs spark-submit

The documentation on spark-submit says the following:

The spark-submit script in Spark’s bin directory is used to launch applications on a cluster.

Regarding the pyspark it says the following:

You can also use bin/pyspark to launch an interactive Python shell.

This question may sound stupid, but when i am running the commands though pyspark they also run on the "cluster", right? They do not run on the master node only, right?

like image 601
Denys Avatar asked Nov 29 '22 23:11

Denys


2 Answers

There is no practical difference between these two. If not configured otherwise both will execute code in a local mode. If master is configured (either by --master command line parameter or spark.master configuration) corresponding cluster will be used to execute the program.

like image 119
zero323 Avatar answered Dec 05 '22 09:12

zero323


If you are using EMR , there are three things

  1. using pyspark(or spark-shell)
  2. using spark-submit without using --master and --deploy-mode
  3. using spark-submit and using --master and --deploy-mode

although using all the above three will run the application in spark cluster, there is a difference how the driver program works.

  • in 1st and 2nd the driver will be in client mode whereas in 3rd the driver will also be in the cluster.
  • in 1st and 2nd, you will have to wait untill one application complete to run another, but in 3rd you can run multiple applications in parallel.
like image 21
braj Avatar answered Dec 05 '22 09:12

braj