Using Spark 2.x, it seems I cannot create a Dataframe using a RDD of Row composed of case classes.
It did work fine on Spark 1.6.x but fails on 2.x the following runtime exception:
java.lang.RuntimeException: Timestamp is not a valid external type for schema of struct<seconds:bigint,nanos:int>
preceded by a bunch of generated code from Catalyst.
Here is the snippet (simplified version of what I am doing):
package main
import org.apache.spark.sql.{SparkSession, Row}
import org.apache.spark.sql.types.{IntegerType, LongType, StructField, StructType}
object Test {
case class Timestamp(seconds: Long, nanos: Int)
val TIMESTAMP_TYPE = StructType(List(
StructField("seconds", LongType, false),
StructField("nanos", IntegerType, false)
))
val SCHEMA = StructType(List(
StructField("created_at", TIMESTAMP_TYPE, true)
))
def main(args: Array[String]) {
val spark = SparkSession.builder().getOrCreate()
val rowRDD = spark.sparkContext.parallelize(Seq((0L, 0))).map {
case (seconds: Long, nanos: Int) => {
Row(Timestamp(seconds, nanos))
}
}
spark.createDataFrame(rowRDD, SCHEMA).show(1)
}
}
I am not sure if this is a Spark bug or something I missed in the documentation (I know Spark 2.x introduced runtime Row encoding validation, maybe this is related)
Help much appreciated
I am not sure if it is a bug or not but mixing dynamically typed Row
, case classes and explicit schema doesn't make much sense. Either use Rows
and schema:
import collection.mutable._
import collection.JavaConverters._
spark.createDataFrame(ArrayBuffer(Row(Row(0L, 0))).asJava, SCHEMA)
or case classes:
import spark.implicits._
Seq(Tuple1(Timestamp(0L, 0))).toDF("created_at")
Otherwise you're just doing the same job twice.
Note:
If you want express that fields can be nullable you use Options
. For example
case class Record(created_at: Option[Timestamp])
case class Timestamp(seconds: Long, nanos: Option[Int])
Seq(Record(Some(Timestamp(0L, Some(0))))).toDF
will generate schema where created_at
and created_at.milliseconds
can be NULL
, but created_at.seconds
has to be set if created_at
is not NULL
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With