I'm trying to launch a standalone Spark cluster using its pre-packaged EC2 scripts, but it just indefinitely hangs in an 'ssh-ready' state:
ubuntu@machine:~/spark-1.2.0-bin-hadoop2.4$ ./ec2/spark-ec2 -k <key-pair> -i <identity-file>.pem -r us-west-2 -s 3 launch test
Setting up security groups...
Searching for existing cluster test...
Spark AMI: ami-ae6e0d9e
Launching instances...
Launched 3 slaves in us-west-2c, regid = r-b_______6
Launched master in us-west-2c, regid = r-0______0
Waiting for all instances in cluster to enter 'ssh-ready' state..........
Yet I can SSH into these instances without complaint:
ubuntu@machine:~$ ssh -i <identity-file>.pem root@master-ip
Last login: Day MMM DD HH:mm:ss 20YY from c-AA-BBB-CCCC-DDD.eee1.ff.provider.net
__| __|_ )
_| ( / Amazon Linux AMI
___|\___|___|
https://aws.amazon.com/amazon-linux-ami/2013.03-release-notes/
There are 59 security update(s) out of 257 total update(s) available
Run "sudo yum update" to apply all updates.
Amazon Linux version 2014.09 is available.
root@ip-internal ~]$
I'm trying to figure out if this is a problem in AWS or with the Spark scripts. I've never had this issue before until recently.
This issue is fixed in Spark 1.3.0.
Your problem is caused by SSH silently stopping because of conflicting entries in you SSHs known_hosts
file.
To resolve your issue add -o UserKnownHostsFile=/dev/null
to your spark_ec2.py
script like this.
Optionally, to clean up and avoid running into problems with connecting to your cluster with SSH later on I recommend you to:
~/.ssh/known_hosts
that include EC2 hosts, for example:ec2-54-154-27-180.eu-west-1.compute.amazonaws.com,54.154.27.180 ssh-rsa (...)
I had the same problem and I followed all the steps mentioned in the thread (mainly adding -o UserKnownHostsFile=/dev/null to your spark_ec2.py script), still it was hanging saying
Waiting for all instances in cluster to enter 'ssh-ready' state
Change permission of the private key file and rerun the spark-ec2 script
[spar@673d356d]/tmp/spark-1.2.1-bin-hadoop2.4/ec2% chmod 0400 /tmp/mykey.pem
To troubleshoot, I modified spark_ec2.py and logged the the ssh command used and tried to execute it on command prompt, it was the bad permission on the key:
[spar@673d356d]/tmp/spark-1.2.1-bin-hadoop2.4/ec2% ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i /tmp/mykey.pem -o ConnectTimeout=3 [email protected]
Warning: Permanently added '52.1.208.72' (RSA) to the list of known hosts.
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: UNPROTECTED PRIVATE KEY FILE! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
Permissions 0644 for '/tmp/mykey.pem' are too open.
It is required that your private key files are NOT accessible by others.
This private key will be ignored.
bad permissions: ignore key: /tmp/mykey.pem
Permission denied (publickey).
I just ran into the same exact situation. I went into the python script at def is_ssh_available()
and had it dump out the return code and cmd.
except subprocess.CalledProcessError, e:
print "CalledProcessError "
print e.returncode
print e.cmd
I had the key file location as ~/.pzkeys/mykey.pem
- as an experiment, I changed it to fully qualified - i.e. /home/pete.zybrick/.pzkeys/mykey.pem
and that worked ok.
Right after that, I ran into another error - I tried to use --user=ec2-user
(I try to avoid using root), then I got a permission error on rsync, removed the --user-ec2-user
so it would use root as default, did another attempt with --resume
, ran to successful completion.
I used the absolute (not relative) path to my identity file (inspired by Peter Zybrick) and did everything Grzegorz Dubicki suggested. Thank you.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With