I am trying to spin up a cluster in AWS using EMR w/ Spark. I have a single bash bootstrap script to install some python packages, download credentials, and apply some configuration. The bootstrap action succeeds on the master, but fails on the slave. The only hint of an error is "i-#####: failed to start. bootstrap action 2 failed with non-zero exit code". The message immediately preceding it is "i-#####: bootstrap action 1 completed". (In both cases referring to the instance ID of the slave. The master is also reporting success of bootstrap action 1).
So it looks like the last command to execute in bootstrap action 2 is having an error and causing the bootstrap script to return a non-zero exit code. However, I only configured one bootstrap action. Do non-master nodes have another bootstrap action running automatically?
None of the logs show what the actual error is. I've looked at the bootstrap logs on S3 (which do not reliably show up) and tried tailing the /var/log/bootstrap-actions/ logs on the slave and master during startup.
I am pretty sure the error is not in my script (says every developer ever...). I am able to create a vanilla EMR cluster w/ no bootstrapping, then login once it is waiting and run my bootstrap script as user hadoop w/ no errors. I have also reviewed the last few commands (a grep and an echo) and verified that they do not return a non-zero exit nor cause the script to return a non-zero exit code.
I think the problem must be in some mystery second bootstrap action. Is this the case? How do I determine the error?
UPDATE
I logged into the slave node during boot up. I found the bootstrap action in /emr/instance-controller/lib/bootstrap-actions
. There was only 1 sub folder and it contained my bootstrap script. I then ran
tail -f /emr/instance-controller/log/instance-controller.log
. I verified that my script launched. After about 15 cycles of status checks (15 minutes), I see
2017-06-02 13:44:30,173 INFO InstanceConfigurer: Script 1 - Execution succeeded
Then I see another AWS script starting up and this seems to be the one that fails.
2017-06-02 13:44:30,181 INFO InstanceConfigurer: Running provision-node, with id 5aed1c54-4210-4387-944a-4fdbbce6dc8d
2017-06-02 13:44:30,188 INFO InstanceConfigurer: Script 5aed1c54-4210-4387-944a-4fdbbce6dc8d - Fetching file '/var/lib/aws/emr/provision-node'
2017-06-02 13:44:30,188 INFO InstanceConfigurer: Script 5aed1c54-4210-4387-944a-4fdbbce6dc8d - startExec '/var/lib/aws/emr/provision-node'
2017-06-02 13:44:30,189 INFO InstanceConfigurer: startExec '/var/lib/aws/emr/provision-node'
2017-06-02 13:44:30,190 INFO InstanceConfigurer: Script 5aed1c54-4210-4387-944a-4fdbbce6dc8d - Environment:
...
2017-06-02 13:44:54,201 INFO InstanceConfigurer: Output from command '/var/lib/aws/emr/provision-node':
stdout:
stderr:
2017-06-02 13:44:54,202 INFO InstanceConfigurer: Script 5aed1c54-4210-4387-944a-4fdbbce6dc8d - waitProcessCompletion ended with exit code 255 : /var/lib/aws/emr/provision-node
2017-06-02 13:44:54,202 INFO InstanceConfigurer: waitProcessCompletion ended with exit code 255 : /var/lib/aws/emr/provision-node
2017-06-02 13:44:54,203 INFO InstanceConfigurer: Script 5aed1c54-4210-4387-944a-4fdbbce6dc8d - total process run time: 24 seconds
2017-06-02 13:44:54,203 INFO InstanceConfigurer: total process run time: 24 seconds
2017-06-02 13:44:54,217 ERROR InstanceConfigurer: Script 5aed1c54-4210-4387-944a-4fdbbce6dc8d - Execution for /var/lib/aws/emr/provision-node failed with code '255'
2017-06-02 13:44:54,219 ERROR InstanceConfigurer: Startup failed with
aws157.instancecontroller.common.model.InstanceConfiguratorException: Source: PROVISION_NODE | ErrorCode: SCRIPT_EXECUTION_FAILED_CODE | Execution for /var/lib/aws/emr/provision-node failed with code '255'
at aws157.instancecontroller.common.InstanceConfigurator.runScript(InstanceConfigurator.java:563)
at aws157.instancecontroller.common.InstanceConfigurator.provisionNode(InstanceConfigurator.java:225)
at aws157.instancecontroller.common.InstanceConfigurator.doDistributionConfigure(InstanceConfigurator.java:201)
at aws157.instancecontroller.common.InstanceConfigurator.access$200(InstanceConfigurator.java:70)
at aws157.instancecontroller.common.InstanceConfigurator$1.run(InstanceConfigurator.java:251)
I'm not familiar with that /var/lib/aws/emr/provision-node
script, but its sole contents are
#!/bin/bash
set -ex
sudo /usr/share/aws/emr/node-provisioner/bin/provision-node "$@"
Looking at /usr/share/aws/emr/node-provisioner/bin/provision-node
, i can see that this script does a bunch of work to identify the path to $EMR_NODE_PROVISIONER_HOME and then run the following Java class from there
java -classpath '/usr/share/aws/emr/node-provisioner/lib/*' com.amazonaws.emr.node.provisioner.Program --phase hadoop _UUID_
I figured this out by viewing the source of the provision-node script and running it independently. I haven't been able to catch the log or failure in real-time to see what goes wrong. When I ran it separately, I got the following exception. but I think this is because I passed garbage data instead of the UUID (I don't know where the UUID comes from and it is different on the launch of each slave).
2017-06-02 14:55:13,593 ERROR main: Encountered a problem while provisioning
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:735)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1569)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1474)
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
at com.google.api.client.http.javanet.NetHttpResponse.<init>(NetHttpResponse.java:37)
at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:94)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:972)
at com.amazonaws.emr.node.provisioner.http.JsonHttpClient.doRequest(JsonHttpClient.java:49)
at com.amazonaws.emr.node.provisioner.platform.EmrPlatformClient.getConfiguration(EmrPlatformClient.java:38)
at com.amazonaws.emr.node.provisioner.platform.EmrPlatformClient.getConfiguration(EmrPlatformClient.java:31)
at com.amazonaws.emr.node.provisioner.bigtop.config.PlatformContextProvider.provide(PlatformContextProvider.java:32)
at com.amazonaws.emr.node.provisioner.phase.PhaseWorkflow.work(PhaseWorkflow.java:51)
at com.amazonaws.emr.node.provisioner.phase.ProvisionHadoopPhase.perform(ProvisionHadoopPhase.java:21)
at com.amazonaws.emr.node.provisioner.Program.main(Program.java:20)
So my question now is what is com.amazonaws.emr.node.provisioner.Program and why is it failing (or how do I figure out why?)?
UPDATE 2
I managed to tail the outputs of /usr/share/aws/emr/node-provisioner/bin/provision-node all the way to failure and the result is the same as in my independent running above.
java -classpath '/usr/share/aws/emr/node-provisioner/lib/*' com.amazonaws.emr.node.provisioner.Program --phase hadoop
2017-06-02 17:05:37,869 ERROR main: Encountered a problem while provisioning
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:735)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1569)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1474)
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
at com.google.api.client.http.javanet.NetHttpResponse.<init>(NetHttpResponse.java:37)
at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:94)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:972)
at com.amazonaws.emr.node.provisioner.http.JsonHttpClient.doRequest(JsonHttpClient.java:49)
at com.amazonaws.emr.node.provisioner.platform.EmrPlatformClient.getConfiguration(EmrPlatformClient.java:38)
at com.amazonaws.emr.node.provisioner.platform.EmrPlatformClient.getConfiguration(EmrPlatformClient.java:31)
at com.amazonaws.emr.node.provisioner.bigtop.config.PlatformContextProvider.provide(PlatformContextProvider.java:32)
at com.amazonaws.emr.node.provisioner.phase.PhaseWorkflow.work(PhaseWorkflow.java:51)
at com.amazonaws.emr.node.provisioner.phase.ProvisionHadoopPhase.perform(ProvisionHadoopPhase.java:21)
at com.amazonaws.emr.node.provisioner.Program.main(Program.java:20)
I'm guessing this might be a firewall/security group issue, but I'm using the default security groups generated by EMR so I would expect the ports to be open. I am building this cluster in a private subnet of a VPC so that may be an issue. However, when I build a cluster with no bootstrapping, I don't get this failure. My next debug step is to build a vanilla cluster w/o bootstrapping and watch for this same command.
UPDATE 3 Confirmed that a vanilla EMR w/ Spark deployment succeeds w/ no network changes. No errors in /usr/share/aws/emr/node-provisioner/bin/provision-node. After launching the java command, the next line of stderr shows a JSON dump of platform config parameters. However, stdout shows a yum package install from repo Bigtop. I don't see a yum command in the script or stderr output (from set -xe) so I think the yum commands must be within that Java program. Not sure why they succeed here but not w/ a bootstrap action.
My private VPC does have an S3 endpoint w/ subnet routing and firewall rules allowing access to the endpoint plist. My bootstrap script is able to successfully install packages using yum (not from Bigtop repo), copy files from S3, and download code from an external git repo on the internet.
I ran into this problem myself and here is what I have figured out. The bootstrap action 2 refers to the node provisioning, which fails due to a misstep done in custom bootstrap action 1. For example in my case, I was installing a MySQL server with version X and then during node provisioning, the execution fails since yum was trying to install a different version of MySQL server. It seems conflicting yum updates are one major source of the problem.
The messaging is very unclear and there is no good documentation either.
If you run into a similar situation, go through all the logs of your cluster. In my case, I found the problematic message in the following logfile
s3://aws-logs-<account-id>-us-west-2/elasticmapreduce/<cluster-id>/node/<instance-id>/provision-node/apps-phase/0/93d592d9-d7e8-47a8-9972-05922c5a9885/stderr.gz
My bootstrap script was running a yum update. When I comment that out, I am able to get past the provision-node script and eventually get the cluster to a waiting state. One of the updates must have been creating some kind of conflict or other problem. I don't know which one. For now, I'm just going to avoid running yum update.
Here is the yum log. I'm guessing it is not one of the R or mysql packages. Perhaps java, kernel, aws, or util-linux?
Installed:
kernel.x86_64 0:4.9.27-14.31.amzn1
Updated:
R.x86_64 0:3.3.3-1.51.amzn1
R-core.x86_64 0:3.3.3-1.51.amzn1
R-core-devel.x86_64 0:3.3.3-1.51.amzn1
R-devel.x86_64 0:3.3.3-1.51.amzn1
R-java.x86_64 0:3.3.3-1.51.amzn1
R-java-devel.x86_64 0:3.3.3-1.51.amzn1
aws-amitools-ec2.noarch 0:1.5.13-0.2.amzn1
aws-cli.noarch 0:1.11.83-1.46.amzn1
java-1.8.0-openjdk.x86_64 1:1.8.0.131-2.b11.30.amzn1
java-1.8.0-openjdk-devel.x86_64 1:1.8.0.131-2.b11.30.amzn1
java-1.8.0-openjdk-headless.x86_64 1:1.8.0.131-2.b11.30.amzn1
libRmath.x86_64 0:3.3.3-1.51.amzn1
libRmath-devel.x86_64 0:3.3.3-1.51.amzn1
libblkid.x86_64 0:2.23.2-33.28.amzn1
libmount.x86_64 0:2.23.2-33.28.amzn1
libuuid.x86_64 0:2.23.2-33.28.amzn1
mysql-config.x86_64 0:5.5.56-1.17.amzn1
mysql55.x86_64 0:5.5.56-1.17.amzn1
mysql55-devel.x86_64 0:5.5.56-1.17.amzn1
mysql55-libs.x86_64 0:5.5.56-1.17.amzn1
ntp.x86_64 0:4.2.6p5-44.34.amzn1
ntpdate.x86_64 0:4.2.6p5-44.34.amzn1
python27-botocore.noarch 0:1.5.46-1.63.amzn1
python27-jmespath.noarch 0:0.9.2-1.12.amzn1
util-linux.x86_64 0:2.23.2-33.28.amzn1
Further insights welcome. Otherwise, moving on to actually getting my code running.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With