Apache Spark Task not Serializable

Question

I realize this question has been asked before, but I think my failure is due to a different reason.

            List<Tuple2<String, Integer>> results = results.collect();
            for (int i=0; i<results.size(); i++) {
                System.out.println(results.get(0)._1);
            }


Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: tools.MAStreamProcessor$1 at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214) at

I have a simple 'map/reduce' program in Spark. The above lines take the results of the reduce step and loop through each resultant element. If I comment them out, then I get no errors. I stayed away from using 'forEach' or the concise for () thinking that the underlying generated produce elements that aren't serializable. I've gotten it down to a simple for loop and so wonder why I am still running into this error.

Thanks, Ranjit

Daniel Darabos · Accepted Answer

Use the -Dsun.io.serialization.extendedDebugInfo=true flag to turn on serialization debug logging. It will tell you what exactly it's unable to serialize.

The answer will have nothing to do with the lines you pasted. The collect is not the source of the problem, it's just what triggers the computation of the RDD. If you don't compute the RDD, nothing gets sent to the executors. So the accidental inclusion of something unserializable in an earlier step causes no problems without collect.

Apache Spark Task not Serializable

Tags:

exception

serialization

apache-spark

Ranjit Iyer

1 Answers

Daniel Darabos

Recent Activity

Donate For Us

Apache Spark Task not Serializable

Tags:

exception

serialization

apache-spark

Ranjit Iyer

1 Answers

Daniel Darabos

Related questions

Recent Activity

Donate For Us