Configure EMR Cluster for Fair Scheduling

Question

I am trying to spin up an emr cluster with fair scheduling such that I can run multiple steps in parallel. I see that this is possible via pipeline (https://aws.amazon.com/about-aws/whats-new/2015/06/run-parallel-hadoop-jobs-on-your-amazon-emr-cluster-using-aws-data-pipeline/), but I already have cluster management / creating automated via an airflow job calling the awscli[1] so it would be great to just update my configurations.

aws emr create-cluster \
        --applications Name=Spark Name=Ganglia \
        --ec2-attributes "${EC2_PROPERTIES}" \
        --service-role EMR_DefaultRole \
        --release-label emr-5.8.0 \
        --log-uri ${S3_LOGS} \
        --enable-debugging \
        --name ${CLUSTER_NAME} \
        --region us-east-1 \
        --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=4,InstanceType=m3.xlarge)

I think it may be achieved using the --configurations (https://docs.aws.amazon.com/cli/latest/reference/emr/create-cluster.html) flag, but not sure of the correct env names

annunarcist · Accepted Answer

Yes, you are correct. You can use EMR configurations to achieve your goal. You can create a JSON file with something like below :

yarn-config.json:

[
  {
    "Classification": "yarn-site",
    "Properties": {
      "yarn.resourcemanager.scheduler.class": "org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler"
    }
  }
]

as per Hadoop Fair Scheduler docs

Then modify you AWS CLI as :

aws emr create-cluster \
        --applications Name=Spark Name=Ganglia \
        --ec2-attributes "${EC2_PROPERTIES}" \
        --service-role EMR_DefaultRole \
        --release-label emr-5.8.0 \
        --log-uri ${S3_LOGS} \
        --enable-debugging \
        --name ${CLUSTER_NAME} \
        --region us-east-1 \
        --instance-groups \
        --configurations file://yarn-config.json
InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge 
InstanceGroupType=CORE,InstanceCount=4,InstanceType=m3.xlarge)

Configure EMR Cluster for Fair Scheduling

Tags:

apache-spark

hadoop

emr

amazon-emr

joecoder

1 Answers

annunarcist

Recent Activity

Donate For Us

Configure EMR Cluster for Fair Scheduling

Tags:

apache-spark

hadoop

emr

amazon-emr

joecoder

1 Answers

annunarcist

Related questions

Recent Activity

Donate For Us