Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Spark Task not Serializable

I realize this question has been asked before, but I think my failure is due to a different reason.

            List<Tuple2<String, Integer>> results = results.collect();
            for (int i=0; i<results.size(); i++) {
                System.out.println(results.get(0)._1);
            }


Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: tools.MAStreamProcessor$1 at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214) at 

I have a simple 'map/reduce' program in Spark. The above lines take the results of the reduce step and loop through each resultant element. If I comment them out, then I get no errors. I stayed away from using 'forEach' or the concise for () thinking that the underlying generated produce elements that aren't serializable. I've gotten it down to a simple for loop and so wonder why I am still running into this error.

Thanks, Ranjit

like image 389
Ranjit Iyer Avatar asked Mar 16 '23 12:03

Ranjit Iyer


1 Answers

Use the -Dsun.io.serialization.extendedDebugInfo=true flag to turn on serialization debug logging. It will tell you what exactly it's unable to serialize.

The answer will have nothing to do with the lines you pasted. The collect is not the source of the problem, it's just what triggers the computation of the RDD. If you don't compute the RDD, nothing gets sent to the executors. So the accidental inclusion of something unserializable in an earlier step causes no problems without collect.

like image 144
Daniel Darabos Avatar answered Apr 01 '23 00:04

Daniel Darabos