Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cluster hangs in 'ssh-ready' state using Spark 1.2.0 EC2 launch script

I'm trying to launch a standalone Spark cluster using its pre-packaged EC2 scripts, but it just indefinitely hangs in an 'ssh-ready' state:

ubuntu@machine:~/spark-1.2.0-bin-hadoop2.4$ ./ec2/spark-ec2 -k <key-pair> -i <identity-file>.pem -r us-west-2 -s 3 launch test
Setting up security groups...
Searching for existing cluster test...
Spark AMI: ami-ae6e0d9e
Launching instances...
Launched 3 slaves in us-west-2c, regid = r-b_______6
Launched master in us-west-2c, regid = r-0______0
Waiting for all instances in cluster to enter 'ssh-ready' state..........

Yet I can SSH into these instances without complaint:

ubuntu@machine:~$ ssh -i <identity-file>.pem root@master-ip
Last login: Day MMM DD HH:mm:ss 20YY from c-AA-BBB-CCCC-DDD.eee1.ff.provider.net

       __|  __|_  )
       _|  (     /   Amazon Linux AMI
      ___|\___|___|

https://aws.amazon.com/amazon-linux-ami/2013.03-release-notes/
There are 59 security update(s) out of 257 total update(s) available
Run "sudo yum update" to apply all updates.
Amazon Linux version 2014.09 is available.
root@ip-internal ~]$

I'm trying to figure out if this is a problem in AWS or with the Spark scripts. I've never had this issue before until recently.

like image 579
nmurthy Avatar asked Jan 17 '15 17:01

nmurthy


4 Answers

Spark 1.3.0+

This issue is fixed in Spark 1.3.0.


Spark 1.2.0

Your problem is caused by SSH silently stopping because of conflicting entries in you SSHs known_hosts file.

To resolve your issue add -o UserKnownHostsFile=/dev/null to your spark_ec2.py script like this.


Optionally, to clean up and avoid running into problems with connecting to your cluster with SSH later on I recommend you to:

  1. Remove all the lines from ~/.ssh/known_hosts that include EC2 hosts, for example:

ec2-54-154-27-180.eu-west-1.compute.amazonaws.com,54.154.27.180 ssh-rsa (...)

  1. Use this solution to stop checking and storing the fingerprints of temporary IP of your EC2 instances at all
like image 56
Greg Dubicki Avatar answered Nov 06 '22 00:11

Greg Dubicki


I had the same problem and I followed all the steps mentioned in the thread (mainly adding -o UserKnownHostsFile=/dev/null to your spark_ec2.py script), still it was hanging saying

Waiting for all instances in cluster to enter 'ssh-ready' state

Short answer:

Change permission of the private key file and rerun the spark-ec2 script

[spar@673d356d]/tmp/spark-1.2.1-bin-hadoop2.4/ec2% chmod 0400 /tmp/mykey.pem

Long Answer:

To troubleshoot, I modified spark_ec2.py and logged the the ssh command used and tried to execute it on command prompt, it was the bad permission on the key:

[spar@673d356d]/tmp/spark-1.2.1-bin-hadoop2.4/ec2% ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i /tmp/mykey.pem  -o ConnectTimeout=3 [email protected] 
Warning: Permanently added '52.1.208.72' (RSA) to the list of known hosts.
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@         WARNING: UNPROTECTED PRIVATE KEY FILE!          @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
Permissions 0644 for '/tmp/mykey.pem' are too open.
It is required that your private key files are NOT accessible by others.
This private key will be ignored.
bad permissions: ignore key: /tmp/mykey.pem
Permission denied (publickey).
like image 25
spar128 Avatar answered Nov 06 '22 01:11

spar128


I just ran into the same exact situation. I went into the python script at def is_ssh_available() and had it dump out the return code and cmd.

except subprocess.CalledProcessError, e:
print "CalledProcessError "
print e.returncode
print e.cmd

I had the key file location as ~/.pzkeys/mykey.pem - as an experiment, I changed it to fully qualified - i.e. /home/pete.zybrick/.pzkeys/mykey.pem and that worked ok.

Right after that, I ran into another error - I tried to use --user=ec2-user (I try to avoid using root), then I got a permission error on rsync, removed the --user-ec2-user so it would use root as default, did another attempt with --resume, ran to successful completion.

like image 40
Pete Zybrick Avatar answered Nov 06 '22 01:11

Pete Zybrick


I used the absolute (not relative) path to my identity file (inspired by Peter Zybrick) and did everything Grzegorz Dubicki suggested. Thank you.

like image 1
nmurthy Avatar answered Nov 06 '22 02:11

nmurthy