Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Change hadoop version using spark-ec2

I want to know is it possible to change the hadoop version when the cluster is created by spark-ec2?

I tried

spark-ec2 -k spark -i ~/.ssh/spark.pem -s 1 launch my-spark-cluster

then I login with

spark-ec2 -k spark -i ~/.ssh/spark.pem login my-spark-cluster

and found out the hadoop version is 1.0.4.

I want to use 2.x version of hadoop, what's the best way to do configure this?

like image 875
user3684014 Avatar asked Feb 10 '15 23:02

user3684014


1 Answers

Hadoop 2.0

spark-ec2 script doesn't support modifying existing cluster but you can create a new Spark cluster with Hadoop 2.

See this excerpt from the script's --help:

  --hadoop-major-version=HADOOP_MAJOR_VERSION
                    Major version of Hadoop (default: 1)

So for example:

spark-ec2 -k spark -i ~/.ssh/spark.pem -s 1 --hadoop-major-version=2 launch my-spark-cluster

..will create you a cluster using current version of Spark and Hadoop 2.


If you use Spark v. 1.3.1 or Spark v. 1.4.0 and will create a standalone cluster then you will get Hadoop v. 2.0.0 MR1 (from Cloudera Hadoop Platform 4.2.0 distribution) this way.


The caveats are:

  • some features are not yet supported with this Hadoop version because of bugs - for example there is a problem with using Tachyon,
  • although in theory in Spark 1.4.0 you can create a YARN cluster using spark-ec2 its [not documented yet] as of June 2015 and our tries to use it have failed,

..but I have successfully used a few clusters of Spark 1.2.0 and 1.3.1 created with Hadoop 2.0.0, using some Hadoop2-specific features. (for Spark 1.2.0 with a few tweaks, that I have put in my forks of Spark and spark-ec2, but that's another story.)


Hadoop 2.4, 2.6

If you need Hadoop 2.4 or Hadoop 2.6 then I would currently (as of June 2015) recommend you to create a standalone cluster manually - it's easier than you probably think.

like image 74
Greg Dubicki Avatar answered Oct 12 '22 14:10

Greg Dubicki