I realize this question has been asked before, but I think my failure is due to a different reason.
List<Tuple2<String, Integer>> results = results.collect();
for (int i=0; i<results.size(); i++) {
System.out.println(results.get(0)._1);
}
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: tools.MAStreamProcessor$1 at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214) at
I have a simple 'map/reduce' program in Spark. The above lines take the results of the reduce step and loop through each resultant element. If I comment them out, then I get no errors. I stayed away from using 'forEach' or the concise for () thinking that the underlying generated produce elements that aren't serializable. I've gotten it down to a simple for loop and so wonder why I am still running into this error.
Thanks, Ranjit
Use the -Dsun.io.serialization.extendedDebugInfo=true
flag to turn on serialization debug logging. It will tell you what exactly it's unable to serialize.
The answer will have nothing to do with the lines you pasted. The collect
is not the source of the problem, it's just what triggers the computation of the RDD. If you don't compute the RDD, nothing gets sent to the executors. So the accidental inclusion of something unserializable in an earlier step causes no problems without collect
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With