It seems that by default EMR deploys the Spark driver to one of the CORE nodes, resulting in the MASTER node being virtually un-utilized. Is it possible to run the driver program on the MASTER node instead? I have experimented with the --deploy-mode
arguments to no avail.
Here is my instance groups JSON definition:
[
{
"InstanceGroupType": "MASTER",
"InstanceCount": 1,
"InstanceType": "m3.xlarge",
"Name": "Spark Master"
},
{
"InstanceGroupType": "CORE",
"InstanceCount": 3,
"InstanceType": "m3.xlarge",
"Name": "Spark Executors"
}
]
Here is my configurations JSON definition:
[
{
"Classification": "spark",
"Properties": {
"maximizeResourceAllocation": "true"
},
"Configurations": []
},
{
"Classification": "spark-env",
"Properties": {
},
"Configurations": [
{
"Classification": "export",
"Properties": {
},
"Configurations": [
]
}
]
}
]
Here is my steps JSON definition:
[
{
"Name": "example",
"Type": "SPARK",
"Args": [
"--class", "com.name.of.Class",
"/home/hadoop/myjar-assembly-1.0.jar"
],
"ActionOnFailure": "TERMINATE_CLUSTER"
}
]
I am using aws emr create-cluster
with --release-label emr-4.3.0
.
I don't think it is a waste. When running Spark on EMR, the master node will run Yarn RM, Livy Server, and maybe other applications you selected. And if you run in client mode, the majority of the driver program will run on the master node as well.
Note that the driver program could be heavier than the tasks on executors, e.g. collecting all results from all executors, in which case you need to allocate enough resources to your master node if it is where the driver program is running.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With