Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Running EMR job from ECS Docker container

I have containerized ML job code written in python into a docker container and able to run as docker service using Amazon ECS. I would like to run in distributed way using Spark - Pyspark and deploy on Amazon EMR.Can I establish connection between ECS and EMR?

like image 720
akshat thakar Avatar asked May 25 '17 12:05

akshat thakar


1 Answers

Configuring services:

Steps for configuring EC2 instance to submit to the EMR cluster can be found here:

  • https://aws.amazon.com/premiumsupport/knowledge-center/emr-submit-spark-job-remote-cluster/

Shortened version from the link above adopted for docker running in AWS:

To submit Spark jobs to an EMR cluster from a remote machine, the following must be true:

  1. Network traffic is allowed from the remote machine to all cluster nodes.

  2. All Spark and Hadoop binaries are installed on the remote machine.

  3. The configuration files on the remote machine point to the EMR cluster.

Resolution

Confirm that network traffic is allowed from the remote machine to all cluster nodes

If you are using an EC2 instance as a remote machine or edge node: Allow inbound traffic from that instance's security group to the security groups for each cluster node. If you are using your own machine: Allow inbound traffic from your machine's IP address to the security groups for each cluster node. Install the Spark and other dependent binaries on the remote machine

Installing binaries

To install the binaries, copy the files from the EMR cluster's master node, as explained in the following steps. This is the easiest way to be sure that the same version is installed on both the EMR cluster and the remote machine.

  1. Choose appropriate Docker Base Image.

    • For the Docker base image, I suggest you use the official amazonlinux:2 image that can be found here https://hub.docker.com/_/amazonlinux.
  2. Copy the following files from the EMR cluster's master node to the docker image. Don't change the folder structure or file names.

/etc/yum.repos.d/emr-apps.repo
/var/aws/emr/repoPublicKey.txt
  • ​ These files will give access to AWS software repositories
  1. Run following commands in docker image to install the Spark and Hadoop binaries:
sudo yum install -y hadoop-client
sudo yum install -y hadoop-hdfs
sudo yum install -y spark-core
sudo yum install -y java-1.8.0-openjdk
  • If any of the above fail, try copying /etc/yum.repod.d/ from EMR master and re-run them.
  1. Run the following commands to create the folder structure in Docker image:
sudo mkdir -p /var/aws/emr/
sudo mkdir -p /etc/hadoop/conf
sudo mkdir -p /etc/spark/conf
sudo mkdir -p /var/log/spark/user/
sudo chmod 777 -R /var/log/spark/

  1. Transfer configs from EMR master to your Docker Container
EMR master           Docker container
/etc/spark/conf -->  /etc/spark/conf 
/etc/hadoop/conf/ --> /etc/hadoop/conf/
  1. During the runtime, create the HDFS home directory for the user who will submit the Spark job to the EMR cluster. In the following commands, replace the spark user with the name of your user.
hdfs dfs –mkdir /user/sparkuser
hdfs dfs -chown sparkuser:sparkuser /user/sparkuser

At this point, if you followed the steps, you should be able to run from inside the Docker container and it will execute on your EMR cluster.

spark-submit --master yarn --deploy-mode cluster --class org.apache.spark.examples.SparkPi /usr/lib/spark/examples/jars/spark-examples.jar
like image 66
maksim Avatar answered Oct 01 '22 11:10

maksim