Spark Datasets move away from Row's to <code>Encoder</code>'s for Pojo's/primitives. The <code>Catalyst</code> engine uses an <code>ExpressionEncoder</code> to convert columns in a SQL expression. However there do not appear to be other subclasses of <code>Encoder</code> available to use as a template for our own implementations. Here is an example of code that is happy in Spark 1.X / DataFrames that does not compile in the new regime: <pre class="prettyprint lang-scala prettyprint-override"><code>//mapping each row to RDD tuple df.map(row => { var id: String = if (!has_id) "" else row.getAs[String]("id") var label: String = row.getAs[String]("label") val channels : Int = if (!has_channels) 0 else row.getAs[Int]("channels") val height : Int = if (!has_height) 0 else row.getAs[Int]("height") val width : Int = if (!has_width) 0 else row.getAs[Int]("width") val data : Array[Byte] = row.getAs[Any]("data") match { case str: String => str.getBytes case arr: Array[Byte@unchecked] => arr case _ => { log.error("Unsupport value type") null } } (id, label, channels, height, width, data) }).persist(StorageLevel.DISK_ONLY) </code></pre> } We get a compiler error of <pre class="prettyprint"><code>Error:(56, 11) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases. df.map(row => { ^ </code></pre> So then somehow/somewhere there should be a means to <ul> <li>Define/implement our custom Encoder</li> <li>Apply it when performing a mapping on the <code>DataFrame</code> (which is now a Dataset of type <code>Row</code>)</li> <li>Register the Encoder for use by other custom code</li> </ul> I am looking for code that successfully performs these steps.

As far as I am aware nothing really changed since 1.6 and the solutions described in How to store custom objects in Dataset? are the only available options. Nevertheless your current code should work just fine with default encoders for product types. To get some insight why your code worked in 1.x and may not work in 2.0.0 you'll have to check the signatures. In 1.x <code>DataFrame.map</code> is a method which takes function <code>Row => T</code> and transforms <code>RDD[Row]</code> into <code>RDD[T]</code>. In 2.0.0 <code>DataFrame.map</code> takes a function of type <code>Row => T</code> as well, but transforms <code>Dataset[Row]</code> (a.k.a <code>DataFrame</code>) into <code>Dataset[T]</code> hence <code>T</code> requires an <code>Encoder</code>. If you want to get the "old" behavior you should use <code>RDD</code> explicitly: <pre class="prettyprint lang-scala prettyprint-override"><code>df.rdd.map(row => ???) </code></pre> For <code>Dataset[Row]</code> <code>map</code> see Encoder error while trying to map dataframe row to updated row

Did you import the implicit encoders? import spark.implicits._ http://spark.apache.org/docs/2.0.0-preview/api/scala/index.html#org.apache.spark.sql.Encoder

How to create a custom Encoder in Spark 2.X Datasets?

Tags:

scala

apache-spark

apache-spark-dataset

apache-spark-encoders

Spark Datasets move away from Row's to Encoder's for Pojo's/primitives. The Catalyst engine uses an ExpressionEncoder to convert columns in a SQL expression. However there do not appear to be other subclasses of Encoder available to use as a template for our own implementations.

Here is an example of code that is happy in Spark 1.X / DataFrames that does not compile in the new regime:

//mapping each row to RDD tuple
df.map(row => {
    var id: String = if (!has_id) "" else row.getAs[String]("id")
    var label: String = row.getAs[String]("label")
    val channels  : Int = if (!has_channels) 0 else row.getAs[Int]("channels")
    val height  : Int = if (!has_height) 0 else row.getAs[Int]("height")
    val width : Int = if (!has_width) 0 else row.getAs[Int]("width")
    val data : Array[Byte] = row.getAs[Any]("data") match {
      case str: String => str.getBytes
      case arr: Array[Byte@unchecked] => arr
      case _ => {
        log.error("Unsupport value type")
        null
      }
    }
    (id, label, channels, height, width, data)
  }).persist(StorageLevel.DISK_ONLY)

}

We get a compiler error of

Error:(56, 11) Unable to find encoder for type stored in a Dataset.
Primitive types (Int, String, etc) and Product types (case classes) are supported 
by importing spark.implicits._  Support for serializing other types will be added in future releases.
    df.map(row => {
          ^

So then somehow/somewhere there should be a means to

Define/implement our custom Encoder
Apply it when performing a mapping on the DataFrame (which is now a Dataset of type Row)
Register the Encoder for use by other custom code

I am looking for code that successfully performs these steps.

975

asked Jun 08 '16 15:06

WestCoastProjects

2 Answers

As far as I am aware nothing really changed since 1.6 and the solutions described in How to store custom objects in Dataset? are the only available options. Nevertheless your current code should work just fine with default encoders for product types.

To get some insight why your code worked in 1.x and may not work in 2.0.0 you'll have to check the signatures. In 1.x DataFrame.map is a method which takes function Row => T and transforms RDD[Row] into RDD[T].

In 2.0.0 DataFrame.map takes a function of type Row => T as well, but transforms Dataset[Row] (a.k.a DataFrame) into Dataset[T] hence T requires an Encoder. If you want to get the "old" behavior you should use RDD explicitly:

df.rdd.map(row => ???)

For Dataset[Row] map see Encoder error while trying to map dataframe row to updated row

162

answered Oct 05 '22 02:10

zero323

Did you import the implicit encoders?

import spark.implicits._

http://spark.apache.org/docs/2.0.0-preview/api/scala/index.html#org.apache.spark.sql.Encoder

answered Oct 05 '22 04:10

eyal edelman

Related questions
                            
                                Unable to install sbt on Mac
                            
                                Non-strict view of scanLeft
                            
                                Are chained maps optimized by compiler?
                            
                                How to explain TreeNode type restriction and self-type in Spark's TreeNode?
                            
                                Play Framework Scala Environment Variables not found
                            
                                flink: scala version conflict?
                            
                                Using SSH authentication with JGit to Access a Git Repository Securely / Auth fail
                            
                                The actor pattern with Akka and long running processes
                            
                                How to group incoming events from infinite stream?
                            
                                Getting the element from a 1-element Scala collection
                            
                                How to specify OVERWRITE to writeAsText in Apache Flink Streaming 0.10.0?
                            
                                Serialize to object using scala mongo driver?
                            
                                Spark: run an external process in parallel
                            
                                Using a copy of the .ivy2 cache as SBT resolver source
                            
                                How to generically wrap a Rejection with Akka-Http
                            
                                How to pass scalacOptions (Xelide-below) to sbt via command line
                            
                                Composing trait behavior in Scala in an Akka receive method
                            
                                IntelliJ Idea, “Not valid Scala home” error on Windows
                            
                                Difference between Java Optional and Scala Option
                            
                                “Thinking in Scala" if I have a Java/C++ background?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With