Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to connect to Spark EMR from the locally running Spark Shell

Tags:

apache-spark

I have created a Spark EMR cluster. I would like to execute jobs either on my localhost or EMR cluster.

Assuming I run spark-shell on my local computer how can I tell it to connect to the Spark EMR cluster, what would be the exact configuration options and/or commands to run.

like image 406
Datageek Avatar asked Nov 08 '22 14:11

Datageek


1 Answers

It looks like others have also failed at this and ended up running the Spark driver on EMR, but then making use of e.g. Zeppelin or Jupyter running on EMR.

Setting up our own machines as spark drivers that connected to the core nodes on EMR would have been ideal. Unfortunately, this was impossible to do and we forfeited after trying many configuration changes. The driver would start up and then keep waiting unsuccessfully, trying to connect to the slaves.

Most of our Spark development is on pyspark using Jupyter Notebook as our IDE. Since we had to run Jupyter from the master node, we couldn’t risk losing our work if the cluster were to go down. So, we created an EBS volume and attached it to the master node and placed all of our work on this volume. [...]

source

Note: If you go down this route, I would consider using S3 for storing notebooks, then you don't have to manage EBS volumes.

like image 196
m01 Avatar answered Nov 15 '22 09:11

m01