Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

EMR slave bootstrap failure in node provisioner AFTER bootstrap action succeeds

I am trying to spin up a cluster in AWS using EMR w/ Spark. I have a single bash bootstrap script to install some python packages, download credentials, and apply some configuration. The bootstrap action succeeds on the master, but fails on the slave. The only hint of an error is "i-#####: failed to start. bootstrap action 2 failed with non-zero exit code". The message immediately preceding it is "i-#####: bootstrap action 1 completed". (In both cases referring to the instance ID of the slave. The master is also reporting success of bootstrap action 1).

So it looks like the last command to execute in bootstrap action 2 is having an error and causing the bootstrap script to return a non-zero exit code. However, I only configured one bootstrap action. Do non-master nodes have another bootstrap action running automatically?

None of the logs show what the actual error is. I've looked at the bootstrap logs on S3 (which do not reliably show up) and tried tailing the /var/log/bootstrap-actions/ logs on the slave and master during startup.

I am pretty sure the error is not in my script (says every developer ever...). I am able to create a vanilla EMR cluster w/ no bootstrapping, then login once it is waiting and run my bootstrap script as user hadoop w/ no errors. I have also reviewed the last few commands (a grep and an echo) and verified that they do not return a non-zero exit nor cause the script to return a non-zero exit code.

I think the problem must be in some mystery second bootstrap action. Is this the case? How do I determine the error?

UPDATE I logged into the slave node during boot up. I found the bootstrap action in /emr/instance-controller/lib/bootstrap-actions. There was only 1 sub folder and it contained my bootstrap script. I then ran tail -f /emr/instance-controller/log/instance-controller.log. I verified that my script launched. After about 15 cycles of status checks (15 minutes), I see

2017-06-02 13:44:30,173 INFO InstanceConfigurer: Script 1 - Execution succeeded

Then I see another AWS script starting up and this seems to be the one that fails.

2017-06-02 13:44:30,181 INFO InstanceConfigurer: Running provision-node, with id 5aed1c54-4210-4387-944a-4fdbbce6dc8d
2017-06-02 13:44:30,188 INFO InstanceConfigurer: Script 5aed1c54-4210-4387-944a-4fdbbce6dc8d - Fetching file '/var/lib/aws/emr/provision-node'
2017-06-02 13:44:30,188 INFO InstanceConfigurer: Script 5aed1c54-4210-4387-944a-4fdbbce6dc8d - startExec '/var/lib/aws/emr/provision-node'
2017-06-02 13:44:30,189 INFO InstanceConfigurer: startExec '/var/lib/aws/emr/provision-node'
2017-06-02 13:44:30,190 INFO InstanceConfigurer: Script 5aed1c54-4210-4387-944a-4fdbbce6dc8d - Environment:
...
2017-06-02 13:44:54,201 INFO InstanceConfigurer: Output from command '/var/lib/aws/emr/provision-node':
stdout:
stderr:

2017-06-02 13:44:54,202 INFO InstanceConfigurer: Script 5aed1c54-4210-4387-944a-4fdbbce6dc8d - waitProcessCompletion ended with exit code 255 : /var/lib/aws/emr/provision-node
2017-06-02 13:44:54,202 INFO InstanceConfigurer: waitProcessCompletion ended with exit code 255 : /var/lib/aws/emr/provision-node
2017-06-02 13:44:54,203 INFO InstanceConfigurer: Script 5aed1c54-4210-4387-944a-4fdbbce6dc8d - total process run time: 24 seconds
2017-06-02 13:44:54,203 INFO InstanceConfigurer: total process run time: 24 seconds
2017-06-02 13:44:54,217 ERROR InstanceConfigurer: Script 5aed1c54-4210-4387-944a-4fdbbce6dc8d - Execution for /var/lib/aws/emr/provision-node failed with code '255'
2017-06-02 13:44:54,219 ERROR InstanceConfigurer: Startup failed with
aws157.instancecontroller.common.model.InstanceConfiguratorException: Source: PROVISION_NODE | ErrorCode: SCRIPT_EXECUTION_FAILED_CODE | Execution for /var/lib/aws/emr/provision-node failed with code '255'
    at aws157.instancecontroller.common.InstanceConfigurator.runScript(InstanceConfigurator.java:563)
    at aws157.instancecontroller.common.InstanceConfigurator.provisionNode(InstanceConfigurator.java:225)
    at aws157.instancecontroller.common.InstanceConfigurator.doDistributionConfigure(InstanceConfigurator.java:201)
    at aws157.instancecontroller.common.InstanceConfigurator.access$200(InstanceConfigurator.java:70)
    at aws157.instancecontroller.common.InstanceConfigurator$1.run(InstanceConfigurator.java:251)

I'm not familiar with that /var/lib/aws/emr/provision-node script, but its sole contents are

#!/bin/bash
set -ex

sudo /usr/share/aws/emr/node-provisioner/bin/provision-node "$@"

Looking at /usr/share/aws/emr/node-provisioner/bin/provision-node, i can see that this script does a bunch of work to identify the path to $EMR_NODE_PROVISIONER_HOME and then run the following Java class from there

java -classpath '/usr/share/aws/emr/node-provisioner/lib/*' com.amazonaws.emr.node.provisioner.Program --phase hadoop _UUID_

I figured this out by viewing the source of the provision-node script and running it independently. I haven't been able to catch the log or failure in real-time to see what goes wrong. When I ran it separately, I got the following exception. but I think this is because I passed garbage data instead of the UUID (I don't know where the UUID comes from and it is different on the launch of each slave).

2017-06-02 14:55:13,593 ERROR main: Encountered a problem while provisioning
java.net.SocketTimeoutException: Read timed out
    at java.net.SocketInputStream.socketRead0(Native Method)
    at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
    at java.net.SocketInputStream.read(SocketInputStream.java:171)
    at java.net.SocketInputStream.read(SocketInputStream.java:141)
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
    at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
    at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:735)
    at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678)
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1569)
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1474)
    at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
    at com.google.api.client.http.javanet.NetHttpResponse.<init>(NetHttpResponse.java:37)
    at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:94)
    at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:972)
    at com.amazonaws.emr.node.provisioner.http.JsonHttpClient.doRequest(JsonHttpClient.java:49)
    at com.amazonaws.emr.node.provisioner.platform.EmrPlatformClient.getConfiguration(EmrPlatformClient.java:38)
    at com.amazonaws.emr.node.provisioner.platform.EmrPlatformClient.getConfiguration(EmrPlatformClient.java:31)
    at com.amazonaws.emr.node.provisioner.bigtop.config.PlatformContextProvider.provide(PlatformContextProvider.java:32)
    at com.amazonaws.emr.node.provisioner.phase.PhaseWorkflow.work(PhaseWorkflow.java:51)
    at com.amazonaws.emr.node.provisioner.phase.ProvisionHadoopPhase.perform(ProvisionHadoopPhase.java:21)
    at com.amazonaws.emr.node.provisioner.Program.main(Program.java:20)

So my question now is what is com.amazonaws.emr.node.provisioner.Program and why is it failing (or how do I figure out why?)?

UPDATE 2

I managed to tail the outputs of /usr/share/aws/emr/node-provisioner/bin/provision-node all the way to failure and the result is the same as in my independent running above.

java -classpath '/usr/share/aws/emr/node-provisioner/lib/*' com.amazonaws.emr.node.provisioner.Program --phase hadoop
2017-06-02 17:05:37,869 ERROR main: Encountered a problem while provisioning
java.net.SocketTimeoutException: Read timed out
    at java.net.SocketInputStream.socketRead0(Native Method)
    at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
    at java.net.SocketInputStream.read(SocketInputStream.java:171)
    at java.net.SocketInputStream.read(SocketInputStream.java:141)
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
    at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
    at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:735)
    at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678)
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1569)
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1474)
    at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
    at com.google.api.client.http.javanet.NetHttpResponse.<init>(NetHttpResponse.java:37)
    at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:94)
    at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:972)
    at com.amazonaws.emr.node.provisioner.http.JsonHttpClient.doRequest(JsonHttpClient.java:49)
    at com.amazonaws.emr.node.provisioner.platform.EmrPlatformClient.getConfiguration(EmrPlatformClient.java:38)
    at com.amazonaws.emr.node.provisioner.platform.EmrPlatformClient.getConfiguration(EmrPlatformClient.java:31)
    at com.amazonaws.emr.node.provisioner.bigtop.config.PlatformContextProvider.provide(PlatformContextProvider.java:32)
    at com.amazonaws.emr.node.provisioner.phase.PhaseWorkflow.work(PhaseWorkflow.java:51)
    at com.amazonaws.emr.node.provisioner.phase.ProvisionHadoopPhase.perform(ProvisionHadoopPhase.java:21)
    at com.amazonaws.emr.node.provisioner.Program.main(Program.java:20)

I'm guessing this might be a firewall/security group issue, but I'm using the default security groups generated by EMR so I would expect the ports to be open. I am building this cluster in a private subnet of a VPC so that may be an issue. However, when I build a cluster with no bootstrapping, I don't get this failure. My next debug step is to build a vanilla cluster w/o bootstrapping and watch for this same command.

UPDATE 3 Confirmed that a vanilla EMR w/ Spark deployment succeeds w/ no network changes. No errors in /usr/share/aws/emr/node-provisioner/bin/provision-node. After launching the java command, the next line of stderr shows a JSON dump of platform config parameters. However, stdout shows a yum package install from repo Bigtop. I don't see a yum command in the script or stderr output (from set -xe) so I think the yum commands must be within that Java program. Not sure why they succeed here but not w/ a bootstrap action.

My private VPC does have an S3 endpoint w/ subnet routing and firewall rules allowing access to the endpoint plist. My bootstrap script is able to successfully install packages using yum (not from Bigtop repo), copy files from S3, and download code from an external git repo on the internet.

like image 993
Dig_Doug Avatar asked Oct 18 '22 11:10

Dig_Doug


2 Answers

I ran into this problem myself and here is what I have figured out. The bootstrap action 2 refers to the node provisioning, which fails due to a misstep done in custom bootstrap action 1. For example in my case, I was installing a MySQL server with version X and then during node provisioning, the execution fails since yum was trying to install a different version of MySQL server. It seems conflicting yum updates are one major source of the problem.

The messaging is very unclear and there is no good documentation either.

If you run into a similar situation, go through all the logs of your cluster. In my case, I found the problematic message in the following logfile

s3://aws-logs-<account-id>-us-west-2/elasticmapreduce/<cluster-id>/node/<instance-id>/provision-node/apps-phase/0/93d592d9-d7e8-47a8-9972-05922c5a9885/stderr.gz
like image 57
hobgoblin Avatar answered Oct 20 '22 10:10

hobgoblin


My bootstrap script was running a yum update. When I comment that out, I am able to get past the provision-node script and eventually get the cluster to a waiting state. One of the updates must have been creating some kind of conflict or other problem. I don't know which one. For now, I'm just going to avoid running yum update.

Here is the yum log. I'm guessing it is not one of the R or mysql packages. Perhaps java, kernel, aws, or util-linux?

Installed:
  kernel.x86_64 0:4.9.27-14.31.amzn1

Updated:
  R.x86_64 0:3.3.3-1.51.amzn1
  R-core.x86_64 0:3.3.3-1.51.amzn1
  R-core-devel.x86_64 0:3.3.3-1.51.amzn1
  R-devel.x86_64 0:3.3.3-1.51.amzn1
  R-java.x86_64 0:3.3.3-1.51.amzn1
  R-java-devel.x86_64 0:3.3.3-1.51.amzn1
  aws-amitools-ec2.noarch 0:1.5.13-0.2.amzn1
  aws-cli.noarch 0:1.11.83-1.46.amzn1
  java-1.8.0-openjdk.x86_64 1:1.8.0.131-2.b11.30.amzn1
  java-1.8.0-openjdk-devel.x86_64 1:1.8.0.131-2.b11.30.amzn1
  java-1.8.0-openjdk-headless.x86_64 1:1.8.0.131-2.b11.30.amzn1
  libRmath.x86_64 0:3.3.3-1.51.amzn1
  libRmath-devel.x86_64 0:3.3.3-1.51.amzn1
  libblkid.x86_64 0:2.23.2-33.28.amzn1
  libmount.x86_64 0:2.23.2-33.28.amzn1
  libuuid.x86_64 0:2.23.2-33.28.amzn1
  mysql-config.x86_64 0:5.5.56-1.17.amzn1
  mysql55.x86_64 0:5.5.56-1.17.amzn1
  mysql55-devel.x86_64 0:5.5.56-1.17.amzn1
  mysql55-libs.x86_64 0:5.5.56-1.17.amzn1
  ntp.x86_64 0:4.2.6p5-44.34.amzn1
  ntpdate.x86_64 0:4.2.6p5-44.34.amzn1
  python27-botocore.noarch 0:1.5.46-1.63.amzn1
  python27-jmespath.noarch 0:0.9.2-1.12.amzn1
  util-linux.x86_64 0:2.23.2-33.28.amzn1

Further insights welcome. Otherwise, moving on to actually getting my code running.

like image 39
Dig_Doug Avatar answered Oct 20 '22 09:10

Dig_Doug