Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scala dependency on Spark installation

I am just getting started with Spark, so downloaded the for Hadoop 1 (HDP1, CDH3) binaries from here and extracted it on a Ubuntu VM. Without installing Scala, I was able to execute the examples in the Quick Start guide from the Spark interactive shell.

  1. Does Spark come included with Scala? If yes, where are the libraries/binaries?
  2. For running Spark in other modes (distributed), do I need to install Scala on all the nodes?

As a side note, I observed that Spark has one of the best documentation around open source projects.

like image 272
Praveen Sripati Avatar asked Jan 24 '14 11:01

Praveen Sripati


People also ask

Does Spark need Scala to install?

If you use scala language then ensure that scale is already installed before using Apache Spark. You can use Python also instead of Scala for programming in Spark but it must also be pre-installed like Scala.

How do I add a dependency to Spark?

The most common method to include this additional dependency is to use --packages argument for the spark-submit command. An example of --packages argument usage is shown in the “Execute” section below. The Apache Spark versions in the build file must match the Spark version in your Spark cluster.

Do I need to install Scala for Pyspark?

You need to download the latest version of Scala. Here, you will see the scala-2.11. 6 version being used. After downloading, you will be able to find the Scala tar file in the Downloads folder.


1 Answers

Does Spark come included with Scala? If yes, where are the libraries/binaries?

The project configuration is placed in project/ folder. I my case here it is:

$ ls project/
build.properties  plugins.sbt  project  SparkBuild.scala  target

When you do sbt/sbt assembly, it downloads appropriate version of Scala along with other project dependencies. Checkout the folder target/ for example:

$ ls target/
scala-2.9.2  streams

Note that Scala version is 2.9.2 for me.

For running Spark in other modes (distributed), do I need to install Scala on all the nodes?

Yes. You can create a single assembly jar as described in Spark documentation

If your code depends on other projects, you will need to ensure they are also present on the slave nodes. A popular approach is to create an assembly jar (or “uber” jar) containing your code and its dependencies. Both sbt and Maven have assembly plugins. When creating assembly jars, list Spark itself as a provided dependency; it need not be bundled since it is already present on the slaves. Once you have an assembled jar, add it to the SparkContext as shown here. It is also possible to submit your dependent jars one-by-one when creating a SparkContext.

like image 61
tuxdna Avatar answered Oct 02 '22 13:10

tuxdna