Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to run Scala script using spark-submit (similarly to Python script)?

I try to execute a simple Scala script using Spark as described in the Spark Quick Start Tutorial. I have not troubles to execute the following Python code:

"""SimpleApp.py"""
from pyspark import SparkContext

logFile = "tmp.txt"  # Should be some file on your system
sc = SparkContext("local", "Simple App")
logData = sc.textFile(logFile).cache()

numAs = logData.filter(lambda s: 'a' in s).count()
numBs = logData.filter(lambda s: 'b' in s).count()

print "Lines with a: %i, lines with b: %i" % (numAs, numBs)

I execute this code using the following command:

/home/aaa/spark/spark-2.1.0-bin-hadoop2.7/bin/spark-submit hello_world.py

However, if I try to do the same using Scala, I have technical problems. In more detail, the code that I try to execute is:

* SimpleApp.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

object SimpleApp {
  def main(args: Array[String]) {
    val logFile = "tmp.txt" // Should be some file on your system
    val conf = new SparkConf().setAppName("Simple Application")
    val sc = new SparkContext(conf)
    val logData = sc.textFile(logFile, 2).cache()
    val numAs = logData.filter(line => line.contains("a")).count()
    val numBs = logData.filter(line => line.contains("b")).count()
    println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
  }
}

I try to execute it in the following way:

/home/aaa/spark/spark-2.1.0-bin-hadoop2.7/bin/spark-submit hello_world.scala

As the result I get the following error message:

Error: Cannot load main class from JAR file

Does anybody know what I am doing wrong?

like image 604
Roman Avatar asked Jun 03 '17 17:06

Roman


2 Answers

Use spark-submit --help to know the options and arguments.

$ ./bin/spark-submit --help
Usage: spark-submit [options] <app jar | python file> [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
Usage: spark-submit run-example [options] example-class [example args]

As you can see in the first Usage spark-submit requires <app jar | python file>.

The app jar argument is a Spark application's jar with the main object (SimpleApp in your case).

You can build the app jar using sbt or maven that you can read in the official documentation's Self-Contained Applications:

Suppose we wish to write a self-contained application using the Spark API. We will walk through a simple application in Scala (with sbt), Java (with Maven), and Python.

and later in the section:

we can create a JAR package containing the application’s code, then use the spark-submit script to run our program.


p.s. Use Spark 2.1.1.

like image 90
Jacek Laskowski Avatar answered Nov 03 '22 05:11

Jacek Laskowski


I want to add to @JacekLaskowski's an alternative solution I use sometimes for POC or tests purposes.

It would be to use the script.scala from inside the spark-shell with :load.

:load /path/to/script.scala

You won't need to define a SparkContext/SparkSession as the script will use the variables defined in the scope of the REPL.

You also don't need to wrap the code in a Scala object.

PS: I consider this more as a hack and not to use for production purposes.

like image 30
eliasah Avatar answered Nov 03 '22 05:11

eliasah