I want to cast the schema of a dataframe to change the type of some columns using Spark and Scala.
Specifically I am trying to use as[U] function whose description reads: "Returns a new Dataset where each record has been mapped on to the specified type. The method used to map columns depend on the type of U"
In principle this is exactly what I want, but I cannot get it to work.
Here is a simple example taken from https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala
// definition of data
val data = Seq(("a", 1), ("b", 2)).toDF("a", "b")
As expected the schema of data is:
root |-- a: string (nullable = true) |-- b: integer (nullable = false)
I would like to cast the column "b" to Double. So I try the following:
import session.implicits._;
println(" --------------------------- Casting using (String Double)")
val data_TupleCast=data.as[(String, Double)]
data_TupleCast.show()
data_TupleCast.printSchema()
println(" --------------------------- Casting using ClassData_Double")
case class ClassData_Double(a: String, b: Double)
val data_ClassCast= data.as[ClassData_Double]
data_ClassCast.show()
data_ClassCast.printSchema()
As I understand the definition of as[u], the new DataFrames should have the following schema
root |-- a: string (nullable = true) |-- b: double (nullable = false)
But the output is
--------------------------- Casting using (String Double) +---+---+ | a| b| +---+---+ | a| 1| | b| 2| +---+---+ root |-- a: string (nullable = true) |-- b: integer (nullable = false) --------------------------- Casting using ClassData_Double +---+---+ | a| b| +---+---+ | a| 1| | b| 2| +---+---+ root |-- a: string (nullable = true) |-- b: integer (nullable = false)
which shows that column "b" has not been cast to double.
Any hints on what I am doing wrong?
BTW: I am aware of the previous post "How to change column types in Spark SQL's DataFrame?" (see How to change column types in Spark SQL's DataFrame?). I know I can change the type of columns one at a time, but I am looking for a more general solution that changes the schema of the whole data in one shot (and I am trying to understand Spark in the process).
Well, since functions are chained and Spark does lazy evaluation, it actually does change the schema of the whole data in one shot, even if you do write it as changing one column at the time like this:
import spark.implicits._
df.withColumn("x", 'x.cast(DoubleType)).withColumn("y", 'y.cast(StringType))...
As an alternative, I'm thinking you could use map
to do your cast in one go, like:
df.map{t => (t._1, t._2.asInstanceOf[Double], t._3.asInstanceOf[], ...)}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With