I get <code>org.apache.spark.SparkException: Task not serializable</code> when I try to execute the following on Spark 1.4.1: <pre class="prettyprint"><code>import java.sql.{Date, Timestamp} import java.text.SimpleDateFormat object ConversionUtils { val iso8601 = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSX") def tsUTC(s: String): Timestamp = new Timestamp(iso8601.parse(s).getTime) val castTS = udf[Timestamp, String](tsUTC _) } val df = frame.withColumn("ts", ConversionUtils.castTS(frame("ts_str"))) df.first </code></pre> Here, <code>frame</code> is a <code>DataFrame</code> that lives within a <code>HiveContext</code>. That data frame does not have any issues. I have similar UDFs for integers and they work without any problem. However, the one with timestamps seems to cause problems. According to the documentation, <code>java.sql.TimeStamp</code> implements <code>Serializable</code>, so that's not the problem. The same is true for <code>SimpleDateFormat</code> as can be seen here. This causes me to believe it's the UDF that's causing problems. However, I'm not sure what and how to fix it. The relevant section of the trace: <pre class="prettyprint"><code>Caused by: java.io.NotSerializableException: ... Serialization stack: - object not serializable (class: ..., value: ...$ConversionUtils$@63ed11dd) - field (class: ...$ConversionUtils$$anonfun$3, name: $outer, type: class ...$ConversionUtils$) - object (class ...$ConversionUtils$$anonfun$3, <function1>) - field (class: org.apache.spark.sql.catalyst.expressions.ScalaUdf$$anonfun$2, name: func$2, type: interface scala.Function1) - object (class org.apache.spark.sql.catalyst.expressions.ScalaUdf$$anonfun$2, <function1>) - field (class: org.apache.spark.sql.catalyst.expressions.ScalaUdf, name: f, type: interface scala.Function1) - object (class org.apache.spark.sql.catalyst.expressions.ScalaUdf, scalaUDF(ts_str#2683)) - field (class: org.apache.spark.sql.catalyst.expressions.Alias, name: child, type: class org.apache.spark.sql.catalyst.expressions.Expression) - object (class org.apache.spark.sql.catalyst.expressions.Alias, scalaUDF(ts_str#2683) AS ts#7146) - element of array (index: 35) - array (class [Ljava.lang.Object;, size 36) - field (class: scala.collection.mutable.ArrayBuffer, name: array, type: class [Ljava.lang.Object;) - object (class scala.collection.mutable.ArrayBuffer, </code></pre>

Try: <pre class="prettyprint"><code>object ConversionUtils extends Serializable { ... } </code></pre>

Spark: Task not Serializable for UDF on DataFrame

Tags:

I get org.apache.spark.SparkException: Task not serializable when I try to execute the following on Spark 1.4.1:

import java.sql.{Date, Timestamp} import java.text.SimpleDateFormat  object ConversionUtils {   val iso8601 = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSX")    def tsUTC(s: String): Timestamp = new Timestamp(iso8601.parse(s).getTime)    val castTS = udf[Timestamp, String](tsUTC _) }  val df = frame.withColumn("ts", ConversionUtils.castTS(frame("ts_str"))) df.first

Here, frame is a DataFrame that lives within a HiveContext. That data frame does not have any issues.

I have similar UDFs for integers and they work without any problem. However, the one with timestamps seems to cause problems. According to the documentation, java.sql.TimeStamp implements Serializable, so that's not the problem. The same is true for SimpleDateFormat as can be seen here.

This causes me to believe it's the UDF that's causing problems. However, I'm not sure what and how to fix it.

The relevant section of the trace:

Caused by: java.io.NotSerializableException: ... Serialization stack:         - object not serializable (class: ..., value: ...$ConversionUtils$@63ed11dd)         - field (class: ...$ConversionUtils$$anonfun$3, name: $outer, type: class ...$ConversionUtils$)         - object (class ...$ConversionUtils$$anonfun$3, <function1>)         - field (class: org.apache.spark.sql.catalyst.expressions.ScalaUdf$$anonfun$2, name: func$2, type: interface scala.Function1)         - object (class org.apache.spark.sql.catalyst.expressions.ScalaUdf$$anonfun$2, <function1>)         - field (class: org.apache.spark.sql.catalyst.expressions.ScalaUdf, name: f, type: interface scala.Function1)         - object (class org.apache.spark.sql.catalyst.expressions.ScalaUdf, scalaUDF(ts_str#2683))         - field (class: org.apache.spark.sql.catalyst.expressions.Alias, name: child, type: class org.apache.spark.sql.catalyst.expressions.Expression)         - object (class org.apache.spark.sql.catalyst.expressions.Alias, scalaUDF(ts_str#2683) AS ts#7146)         - element of array (index: 35)         - array (class [Ljava.lang.Object;, size 36)         - field (class: scala.collection.mutable.ArrayBuffer, name: array, type: class [Ljava.lang.Object;)         - object (class scala.collection.mutable.ArrayBuffer,

977

asked Apr 22 '16 13:04

Ian

1 Answers

Try:

object ConversionUtils extends Serializable {   ... }

177

answered Sep 20 '22 15:09

David Griffin

Related questions
                            
                                How export a Jupyter notebook to HTML from the command line?
                            
                                Load TrueType Font to OpenCV
                            
                                How to run aws configure in a travis deploy script?
                            
                                CompletableFuture in the Android Support Library?
                            
                                What's the most minimalistic way to render "OK" in Elixir/Phoenix?
                            
                                Tkinter custom create buttons
                            
                                How to get claim inside Asp.Net Core Razor View
                            
                                How do I Moq the ApplicationDbContext in .NET Core
                            
                                Cannot find module 'react'
                            
                                Update Typescript in Angular2 project
                            
                                How can I get the window width in angularJS on resize from a controller?
                            
                                making gif from images using imageio in python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark: Task not Serializable for UDF on DataFrame

Tags:

Ian

People also ask

1 Answers

David Griffin

Recent Activity

Donate For Us