Let's say I have a Spark <code>Dataset</code> like this: <pre class="prettyprint"><code>scala> import java.sql.Date scala> case class Event(id: Int, date: Date, name: String) scala> val ds = Seq(Event(1, Date.valueOf("2016-08-01"), "ev1"), Event(2, Date.valueOf("2018-08-02"), "ev2")).toDS </code></pre> I want to create a new <code>Dataset</code> with only the name and date fields. As far as I can see, I can either use <code>ds.select()</code> with <code>TypedColumn</code> or I can use <code>ds.select()</code> with <code>Column</code> and then convert the <code>DataFrame</code> to <code>Dataset</code>. However, I can't get the former option working with the <code>Date</code> type. For example: <pre class="prettyprint"><code>scala> ds.select($"name".as[String], $"date".as[Date]) <console>:31: error: Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases. ds.select($"name".as[String], $"date".as[Date]) ^ </code></pre> The later option works: <pre class="prettyprint"><code>scala> ds.select($"name", $"date").as[(String, Date)] res2: org.apache.spark.sql.Dataset[(String, java.sql.Date)] = [name: string, date: date] </code></pre> Is there a way to select <code>Date</code> fields from <code>Dataset</code> without going to <code>DataFrame</code> and back?

Been bashing my head against problems like these for the whole day. I think you can solve your problem with one line: <pre class="prettyprint"><code>implicit val e: Encoder[(String, Date)] = org.apache.spark.sql.Encoders.kryo[(String,Date)] </code></pre> At least that has been working for me. <h3>EDIT</h3> In these cases, the problem is that for most <code>Dataset</code> operations, Spark 2 requires an <code>Encoder</code> that stores schema information (presumably for optimizations). The schema information takes the form of an implicit parameter (and a bunch of <code>Dataset</code> operations have this sort of implicit parameter). In this case, the OP found the correct schema for <code>java.sql.Date</code> so the following works: <pre class="prettyprint"><code>implicit val e = org.apache.spark.sql.Encoders.DATE </code></pre>

Spark Dataset and java.sql.Date

Tags:

scala

apache-spark

apache-spark-dataset

apache-spark-encoders

Let's say I have a Spark Dataset like this:

scala> import java.sql.Date
scala> case class Event(id: Int, date: Date, name: String)
scala> val ds = Seq(Event(1, Date.valueOf("2016-08-01"), "ev1"), Event(2, Date.valueOf("2018-08-02"), "ev2")).toDS

I want to create a new Dataset with only the name and date fields. As far as I can see, I can either use ds.select() with TypedColumn or I can use ds.select() with Column and then convert the DataFrame to Dataset.

However, I can't get the former option working with the Date type. For example:

scala> ds.select($"name".as[String], $"date".as[Date])
<console>:31: error: Unable to find encoder for type stored in a Dataset.  Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._  Support for serializing other types will be added in future releases.
              ds.select($"name".as[String], $"date".as[Date])
                                                      ^

The later option works:

scala> ds.select($"name", $"date").as[(String, Date)]
res2: org.apache.spark.sql.Dataset[(String, java.sql.Date)] = [name: string, date: date]

Is there a way to select Date fields from Dataset without going to DataFrame and back?

316

asked Aug 05 '16 21:08

Lukáš Lalinský

1 Answers

Been bashing my head against problems like these for the whole day. I think you can solve your problem with one line:

implicit val e: Encoder[(String, Date)] = org.apache.spark.sql.Encoders.kryo[(String,Date)]

At least that has been working for me.

EDIT

In these cases, the problem is that for most Dataset operations, Spark 2 requires an Encoder that stores schema information (presumably for optimizations). The schema information takes the form of an implicit parameter (and a bunch of Dataset operations have this sort of implicit parameter).

In this case, the OP found the correct schema for java.sql.Date so the following works:

implicit val e = org.apache.spark.sql.Encoders.DATE

answered Sep 20 '22 14:09

Alec

Related questions
                            
                                How to make implicit conversion work during pattern matching
                            
                                Java/Scala reflection: Get class methods in order and force object init
                            
                                String range in Scala
                            
                                Missing Sized.unapply
                            
                                Spark atop of Docker not accepting jobs
                            
                                What is the colon in the type parameter of a scala class
                            
                                Scala implicit for arbitrarily deep Functor composition
                            
                                Play Framework 2.3.7: Static assets location not working in production
                            
                                Pass a type parameter to be used as argument LabelledGeneric
                            
                                Scala type inference for existential types and type members
                            
                                In Scala is there a way to reference the Companion Object from within an instance of a Case Class?
                            
                                How to COUNT(*) in Slick 3.0?
                            
                                How to compose two parallel Tasks to cancel one task if another one fails?
                            
                                Differences and similarities between Tasks and Commands in SBT
                            
                                sbt: cross-publish from build.sbt
                            
                                Spark Scala filter DataFrame where value not in another DataFrame
                            
                                Getting elements from an HList
                            
                                What are the performance characteristics between curried, partially applied, and 'normal' functions in Scala?
                            
                                Is there something like Java Stream's "peek" operation in Scala?
                            
                                Scala missing parameter type for expanded function

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With