Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Running Spark on AWS EMR, how to run driver on master node?

It seems that by default EMR deploys the Spark driver to one of the CORE nodes, resulting in the MASTER node being virtually un-utilized. Is it possible to run the driver program on the MASTER node instead? I have experimented with the --deploy-mode arguments to no avail.

Here is my instance groups JSON definition:

[
  {
    "InstanceGroupType": "MASTER",
    "InstanceCount": 1,
    "InstanceType": "m3.xlarge",
    "Name": "Spark Master"
  },
  {
    "InstanceGroupType": "CORE",
    "InstanceCount": 3,
    "InstanceType": "m3.xlarge",
    "Name": "Spark Executors"
  }
]

Here is my configurations JSON definition:

[
  {
    "Classification": "spark",
    "Properties": {
      "maximizeResourceAllocation": "true"
    },
    "Configurations": []
  },
  {
    "Classification": "spark-env",
    "Properties": {
    },
    "Configurations": [
      {
        "Classification": "export",
        "Properties": {
        },
        "Configurations": [
        ]
      }
    ]
  }
]

Here is my steps JSON definition:

[
  {
    "Name": "example",
    "Type": "SPARK",
    "Args": [
      "--class", "com.name.of.Class",
      "/home/hadoop/myjar-assembly-1.0.jar"
    ],
    "ActionOnFailure": "TERMINATE_CLUSTER"
  }
]

I am using aws emr create-cluster with --release-label emr-4.3.0.

like image 650
Landon Kuhn Avatar asked Feb 04 '16 19:02

Landon Kuhn


1 Answers

I don't think it is a waste. When running Spark on EMR, the master node will run Yarn RM, Livy Server, and maybe other applications you selected. And if you run in client mode, the majority of the driver program will run on the master node as well.

Note that the driver program could be heavier than the tasks on executors, e.g. collecting all results from all executors, in which case you need to allocate enough resources to your master node if it is where the driver program is running.

like image 134
Z.Wei Avatar answered Sep 28 '22 07:09

Z.Wei