I have containerized ML job code written in python into a docker container and able to run as docker service using Amazon ECS. I would like to run in distributed way using Spark - Pyspark and deploy on Amazon EMR.Can I establish connection between ECS and EMR?
Steps for configuring EC2 instance to submit to the EMR cluster can be found here:
To submit Spark jobs to an EMR cluster from a remote machine, the following must be true:
Network traffic is allowed from the remote machine to all cluster nodes.
All Spark and Hadoop binaries are installed on the remote machine.
The configuration files on the remote machine point to the EMR cluster.
Confirm that network traffic is allowed from the remote machine to all cluster nodes
If you are using an EC2 instance as a remote machine or edge node: Allow inbound traffic from that instance's security group to the security groups for each cluster node. If you are using your own machine: Allow inbound traffic from your machine's IP address to the security groups for each cluster node. Install the Spark and other dependent binaries on the remote machine
To install the binaries, copy the files from the EMR cluster's master node, as explained in the following steps. This is the easiest way to be sure that the same version is installed on both the EMR cluster and the remote machine.
Choose appropriate Docker Base Image.
amazonlinux:2
image that can be found here https://hub.docker.com/_/amazonlinux.Copy the following files from the EMR cluster's master node to the docker image. Don't change the folder structure or file names.
/etc/yum.repos.d/emr-apps.repo
/var/aws/emr/repoPublicKey.txt
sudo yum install -y hadoop-client
sudo yum install -y hadoop-hdfs
sudo yum install -y spark-core
sudo yum install -y java-1.8.0-openjdk
/etc/yum.repod.d/
from EMR master and re-run them.sudo mkdir -p /var/aws/emr/
sudo mkdir -p /etc/hadoop/conf
sudo mkdir -p /etc/spark/conf
sudo mkdir -p /var/log/spark/user/
sudo chmod 777 -R /var/log/spark/
EMR master Docker container
/etc/spark/conf --> /etc/spark/conf
/etc/hadoop/conf/ --> /etc/hadoop/conf/
hdfs dfs –mkdir /user/sparkuser
hdfs dfs -chown sparkuser:sparkuser /user/sparkuser
At this point, if you followed the steps, you should be able to run from inside the Docker container and it will execute on your EMR cluster.
spark-submit --master yarn --deploy-mode cluster --class org.apache.spark.examples.SparkPi /usr/lib/spark/examples/jars/spark-examples.jar
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With