Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Passing Arguments in Apache Spark

I am running this code on a local machine:

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

object SimpleApp {
  def main(args: Array[String]) {
    val logFile = "/Users/username/Spark/README.md"
    val conf = new SparkConf().setAppName("Simple Application")
    val sc = new SparkContext(conf)
    val logData = sc.textFile(logFile, 2).cache()
    val numAs = logData.filter(line => line.contains("a")).count()
    val numBs = logData.filter(line => line.contains("b")).count()
    println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
  }
}

I'd like to run the program but run it on different files - it currently only runs on README.md. How do I pass the file path of another file when running Spark (or any other argument for that matter?). For example, I'd like to change contains("a") to another letter.

I make the program run by:

$ YOUR_SPARK_HOME/bin/spark-submit \
  --class "SimpleApp" \
  --master local[4] \
  target/scala-2.10/simple-project_2.10-1.0.jar

Thanks!

like image 580
monster Avatar asked Dec 10 '14 14:12

monster


People also ask

How do you pass arguments in Spark shell?

Making this more systematic: Put the code below in a script (e.g. spark-script.sh ), and then you can simply use: ./spark-script.sh your_file. scala first_arg second_arg third_arg , and have an Array[String] called args with your arguments.

How do you pass arguments in Scala code?

The arguments which are passed by the user or programmer to the main() method are termed as Command-Line Arguments. main() method is the entry point of execution of a program. main() method accepts an array of strings.

What is the difference between broadcast and accumulator in Spark?

Spark supports two types of shared variables: broadcast variables, which can be used to cache a value in memory on all nodes, and accumulators, which are variables that are only “added” to, such as counters and sums.


1 Answers

When you set up your main in

 def main(args: Array[String]) {

you are preparing your main to accept anything after the .jar line as an argument. It will make an array named 'args' for you out of them. You then access them as usual with args[n].

It might be good to check your arguments for type and/or format, it usually is if anyone other than you might run this.

So instead of setting the

val logFile = "String here"

set it

val logFile = args(0)

and then pass the file as the first argument. Check spark-submit docs for more on that, but, you just enter it on the next line basically.

like image 138
suiterdev Avatar answered Oct 23 '22 23:10

suiterdev