Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark Dataset and java.sql.Date

Let's say I have a Spark Dataset like this:

scala> import java.sql.Date
scala> case class Event(id: Int, date: Date, name: String)
scala> val ds = Seq(Event(1, Date.valueOf("2016-08-01"), "ev1"), Event(2, Date.valueOf("2018-08-02"), "ev2")).toDS

I want to create a new Dataset with only the name and date fields. As far as I can see, I can either use ds.select() with TypedColumn or I can use ds.select() with Column and then convert the DataFrame to Dataset.

However, I can't get the former option working with the Date type. For example:

scala> ds.select($"name".as[String], $"date".as[Date])
<console>:31: error: Unable to find encoder for type stored in a Dataset.  Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._  Support for serializing other types will be added in future releases.
              ds.select($"name".as[String], $"date".as[Date])
                                                      ^

The later option works:

scala> ds.select($"name", $"date").as[(String, Date)]
res2: org.apache.spark.sql.Dataset[(String, java.sql.Date)] = [name: string, date: date]

Is there a way to select Date fields from Dataset without going to DataFrame and back?

like image 316
Lukáš Lalinský Avatar asked Aug 05 '16 21:08

Lukáš Lalinský


People also ask

Is date in Spark SQL?

Spark SQL supports almost all date and time functions that are supported in Apache Hive. You can use these Spark DataFrame date functions to manipulate the date frame columns that contains date type values. The Spark SQL built-in date functions are user and performance friendly.

How do I change the date format in Spark SQL?

Spark to_date() – Convert String to Date format to_date() – function is used to format string ( StringType ) to date ( DateType ) column. Below code, snippet takes the date in a string and converts it to date format on DataFrame.

How does Spark determine date format in DataFrame?

Spark provides current_date() function to get the current system date in DateType 'yyyy-MM-dd' format and current_timestamp() to get current timestamp in `yyyy-MM-dd HH:mm:ss. SSSS` format.

What is the difference between Spark SQL and SQL?

Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine. It enables unmodified Hadoop Hive queries to run up to 100x faster on existing deployments and data.


1 Answers

Been bashing my head against problems like these for the whole day. I think you can solve your problem with one line:

implicit val e: Encoder[(String, Date)] = org.apache.spark.sql.Encoders.kryo[(String,Date)]

At least that has been working for me.

EDIT

In these cases, the problem is that for most Dataset operations, Spark 2 requires an Encoder that stores schema information (presumably for optimizations). The schema information takes the form of an implicit parameter (and a bunch of Dataset operations have this sort of implicit parameter).

In this case, the OP found the correct schema for java.sql.Date so the following works:

implicit val e = org.apache.spark.sql.Encoders.DATE
like image 93
Alec Avatar answered Sep 20 '22 14:09

Alec