I have created a Spark EMR cluster. I would like to execute jobs either on my localhost or EMR cluster.
Assuming I run spark-shell on my local computer how can I tell it to connect to the Spark EMR cluster, what would be the exact configuration options and/or commands to run.
It looks like others have also failed at this and ended up running the Spark driver on EMR, but then making use of e.g. Zeppelin or Jupyter running on EMR.
Setting up our own machines as spark drivers that connected to the core nodes on EMR would have been ideal. Unfortunately, this was impossible to do and we forfeited after trying many configuration changes. The driver would start up and then keep waiting unsuccessfully, trying to connect to the slaves.
Most of our Spark development is on pyspark using Jupyter Notebook as our IDE. Since we had to run Jupyter from the master node, we couldn’t risk losing our work if the cluster were to go down. So, we created an EBS volume and attached it to the master node and placed all of our work on this volume. [...]
source
Note: If you go down this route, I would consider using S3 for storing notebooks, then you don't have to manage EBS volumes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With