Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Assigned variable not passed to a map function in Spark

I'm using Spark 1.3.1 with Scala 2.10.4. I've tried a basic scenario which consists in parallelizing an array of 3 strings, and mapping them with a variable that I define in the driver.

Here is the code :

object BasicTest extends App {

  val conf = new SparkConf().setAppName("Simple Application").setMaster("spark://xxxxx:7077")
  val sc = new SparkContext(conf)

  val test = sc.parallelize(Array("a", "b", "c"))
  val a = 5

  test.map(row => row + a).saveAsTextFile("output/basictest/")

}

This piece of code works in local mode, I do get a list :

a5
b5
c5

But on a real cluster, I get :

a0
b0
c0

I've tried with another code :

object BasicTest extends App {

  def test(sc: SparkContext): Unit = {

    val test = sc.parallelize(Array("a", "b", "c"))
    val a = 5

    test.map(row => row + a).saveAsTextFile("output/basictest/")

  }
  val conf = new SparkConf().setAppName("Simple Application").setMaster("spark://xxxxx:7077")
  val sc = new SparkContext(conf)

  test(sc)

}

This works in both cases.

I just need to understand the reasons in each case. Thanks in advance for any advice.

like image 704
Gouffe Avatar asked Apr 15 '26 19:04

Gouffe


1 Answers

I believe that this relates to the use of App. This was "resolved" for SPARK-4170 only in that it warns against using App

if (classOf[scala.App].isAssignableFrom(mainClass)) {
   printWarning("Subclasses of scala.App may not work correctly. Use a main() method instead.")
}

Notes from the ticket:

this bug seems to be an issue with the way Scala has been defined and the differences between what happens at runtime versus compile time with respect to the way App leverages the delayedInit function.

like image 164
Justin Pihony Avatar answered Apr 17 '26 09:04

Justin Pihony



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!