Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

what to specify as spark master when running on amazon emr

Spark has native support by EMR. When using the EMR web interface to create a new cluster, it is possible to add a custom step that would execute a Spark application when the cluster starts, basically an automated spark-submit after cluster startup.

I've been wondering how to specify the master node to the SparkConf within the application, when starting the EMR cluster and submitting the jar file through the designated EMR step?

It is not possible to know the IP of the cluster master beforehand, as would be the case if I started the cluster manually and then used the information to build into my application before calling spark-submit.

Code snippet:

SparkConf conf = new SparkConf().setAppName("myApp").setMaster("spark:\\???:7077");
JavaSparkContext sparkContext = new JavaSparkContext(conf);

Note that I am asking about the "cluster" execution mode, so the driver program runs on the cluster as well.

like image 530
user3209815 Avatar asked Mar 11 '23 11:03

user3209815


1 Answers

Short answer: don't.

Longer answer: A master URL like "spark://..." is for Spark Standalone, but EMR uses Spark on YARN, so the master URL should be just "yarn". This is already configured for you in spark-defaults.conf, so when you run spark-submit, you don't even have to include "--master ...".

However, since you are asking about cluster execution mode (actually, it's called "deploy mode"), you may specify either "--master yarn-cluster" (deprecated) or "--deploy-mode cluster" (preferred). This will make the Spark driver run on a random cluster mode rather than on the EMR master.

like image 58
Jonathan Kelly Avatar answered Mar 13 '23 08:03

Jonathan Kelly