Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Where to find data inside a RDD in a eclipse Spark scala debug session?

Tags:

I tried to debug a very simple Spark scala word count program. Since spark is "lazy" so I think I need to put the break point at an "action" statement and then run that line of code, then I'll be able to check those RDD variables before that statements and look at their data. So I put a break point at line 14, when debugging gets there, I hit step over to run line 14. However after doing that, I cannot see/find any data for varaibles text1, text2 in the debug session variable view.(But I can see data inside the "all" variable in the debug view though). Am I doing this right? Why I cannot see data in the text1/text2 variables ?

Suppose my wordCount.txt is like this:

This is a text file with words aa aa bb cc cc

I expect to see (aa,2),(bb,1),(cc,2) etc somewhere in text2 variable view. But I don't find anything like that in there. See screen shot below the codes.

I am using eclipse Neon and Spark2.1 and it is a eclipse local debug session. Your help would be really appreciated as I cannot get any info after extensive search. Here's my code:

package Big_Data.Spark_App 

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext

object WordCount {
  def main(args: Array[String]){
    val conf=new SparkConf().setAppName("WordCountApp").setMaster("local")
    val sc = new SparkContext(conf)    
    val text = sc.textFile("/home/cloudera/Downloads/wordCount.txt")
    val text1 = text.flatMap(rec=>rec.split(" ")).map(rec=>(rec,1))
    val text2 = text1.reduceByKey( (v1,v2)=>v1+v2).cache

    val all = text2.collect()  //line 14
    all.foreach(println)           
  }
}

Here's the debug variable view shows that no actual data in text2 variable

like image 693
Jerry Avatar asked May 16 '17 15:05

Jerry


2 Answers

Spark evaluates lazily. What I do is.. If I want to print on console, I use:

rdd.take(20).foreach(x => println(x))

Or better, rdd.sample, rdd.sampleWithDeviation, rdd.sampleWithReplacement, sampleByKey etc. These give a broader picture with large data sets.

Then there is a rdd.toDebugString which you can print out!

Finally, you can put a breakpoint and observe the RDD in Eclipse/IntelliJ debugger, but only after evaluation.. else you will just see the execution plan but not values.

like image 75
Apurva Singh Avatar answered Sep 21 '22 10:09

Apurva Singh


Spark does not evaluate each variable as you expect, it builds a DAG that gets executed once a trigger is called (eg collect), this post explains this in more detail: How DAG works under the covers in RDD? Essentially, those intermediate variables only store the reference of the chained operations you created. If you'd like to inspect intermediate results, you'd need to call collect on each variable.

EDIT:

Forgot to mention above, that you also have the option to inspect variables inside a Spark operation. Say you break down a mapper like this:

val conf=new SparkConf().setAppName("WordCountApp").setMaster("local")
val sc = new SparkContext(conf)
val text = sc.textFile("wordcount.txt")
val text1 = text.flatMap{ rec =>
  val splitStr = rec.split(" ") //can inspect this variable
  splitStr.map(r => (r, 1)) //can inspect variable r
}
val text2 = text1.reduceByKey( (v1,v2)=>v1+v2).cache
val all = text2.collect() 
all.foreach(println)

You can put a breakpoint in the mapper, for example to inspect splitStr for each line of text, or in the next line to inspect r for each word.

like image 38
jamborta Avatar answered Sep 25 '22 10:09

jamborta