I tried to debug a very simple Spark scala word count program. Since spark is "lazy" so I think I need to put the break point at an "action" statement and then run that line of code, then I'll be able to check those RDD variables before that statements and look at their data. So I put a break point at line 14, when debugging gets there, I hit step over to run line 14. However after doing that, I cannot see/find any data for varaibles text1, text2 in the debug session variable view.(But I can see data inside the "all" variable in the debug view though). Am I doing this right? Why I cannot see data in the text1/text2 variables ?
Suppose my wordCount.txt is like this:
This is a text file with words aa aa bb cc cc
I expect to see (aa,2),(bb,1),(cc,2)
etc somewhere in text2 variable view. But I don't find anything like that in there. See screen shot below the codes.
I am using eclipse Neon and Spark2.1 and it is a eclipse local debug session. Your help would be really appreciated as I cannot get any info after extensive search. Here's my code:
package Big_Data.Spark_App
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
object WordCount {
def main(args: Array[String]){
val conf=new SparkConf().setAppName("WordCountApp").setMaster("local")
val sc = new SparkContext(conf)
val text = sc.textFile("/home/cloudera/Downloads/wordCount.txt")
val text1 = text.flatMap(rec=>rec.split(" ")).map(rec=>(rec,1))
val text2 = text1.reduceByKey( (v1,v2)=>v1+v2).cache
val all = text2.collect() //line 14
all.foreach(println)
}
}
Here's the debug variable view shows that no actual data in text2 variable
Spark evaluates lazily. What I do is.. If I want to print on console, I use:
rdd.take(20).foreach(x => println(x))
Or better, rdd.sample
, rdd.sampleWithDeviation
, rdd.sampleWithReplacement
, sampleByKey
etc. These give a broader picture with large data sets.
Then there is a rdd.toDebugString
which you can print out!
Finally, you can put a breakpoint and observe the RDD in Eclipse/IntelliJ debugger, but only after evaluation.. else you will just see the execution plan but not values.
Spark does not evaluate each variable as you expect, it builds a DAG that gets executed once a trigger is called (eg collect), this post explains this in more detail: How DAG works under the covers in RDD? Essentially, those intermediate variables only store the reference of the chained operations you created. If you'd like to inspect intermediate results, you'd need to call collect on each variable.
EDIT:
Forgot to mention above, that you also have the option to inspect variables inside a Spark operation. Say you break down a mapper like this:
val conf=new SparkConf().setAppName("WordCountApp").setMaster("local")
val sc = new SparkContext(conf)
val text = sc.textFile("wordcount.txt")
val text1 = text.flatMap{ rec =>
val splitStr = rec.split(" ") //can inspect this variable
splitStr.map(r => (r, 1)) //can inspect variable r
}
val text2 = text1.reduceByKey( (v1,v2)=>v1+v2).cache
val all = text2.collect()
all.foreach(println)
You can put a breakpoint in the mapper, for example to inspect splitStr
for each line of text, or in the next line to inspect r
for each word.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With