I want to know is it possible to change the hadoop version when the cluster is created by spark-ec2?
I tried
spark-ec2 -k spark -i ~/.ssh/spark.pem -s 1 launch my-spark-cluster
then I login with
spark-ec2 -k spark -i ~/.ssh/spark.pem login my-spark-cluster
and found out the hadoop version is 1.0.4.
I want to use 2.x version of hadoop, what's the best way to do configure this?
spark-ec2
script doesn't support modifying existing cluster but you can create a new Spark cluster with Hadoop 2.
See this excerpt from the script's --help
:
--hadoop-major-version=HADOOP_MAJOR_VERSION
Major version of Hadoop (default: 1)
So for example:
spark-ec2 -k spark -i ~/.ssh/spark.pem -s 1 --hadoop-major-version=2 launch my-spark-cluster
..will create you a cluster using current version of Spark and Hadoop 2.
If you use Spark v. 1.3.1 or Spark v. 1.4.0 and will create a standalone cluster then you will get Hadoop v. 2.0.0 MR1 (from Cloudera Hadoop Platform 4.2.0 distribution) this way.
The caveats are:
..but I have successfully used a few clusters of Spark 1.2.0 and 1.3.1 created with Hadoop 2.0.0, using some Hadoop2-specific features. (for Spark 1.2.0 with a few tweaks, that I have put in my forks of Spark and spark-ec2, but that's another story.)
If you need Hadoop 2.4 or Hadoop 2.6 then I would currently (as of June 2015) recommend you to create a standalone cluster manually - it's easier than you probably think.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With