Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

cast schema of a data frame in Spark and Scala

I want to cast the schema of a dataframe to change the type of some columns using Spark and Scala.

Specifically I am trying to use as[U] function whose description reads: "Returns a new Dataset where each record has been mapped on to the specified type. The method used to map columns depend on the type of U"

In principle this is exactly what I want, but I cannot get it to work.

Here is a simple example taken from https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala



    // definition of data
    val data = Seq(("a", 1), ("b", 2)).toDF("a", "b")

As expected the schema of data is:


    root
     |-- a: string (nullable = true)
     |-- b: integer (nullable = false)
    

I would like to cast the column "b" to Double. So I try the following:



    import session.implicits._;

    println(" --------------------------- Casting using (String Double)")

    val data_TupleCast=data.as[(String, Double)]
    data_TupleCast.show()
    data_TupleCast.printSchema()

    println(" --------------------------- Casting using ClassData_Double")

    case class ClassData_Double(a: String, b: Double)

    val data_ClassCast= data.as[ClassData_Double]
    data_ClassCast.show()
    data_ClassCast.printSchema()

As I understand the definition of as[u], the new DataFrames should have the following schema


    root
     |-- a: string (nullable = true)
     |-- b: double (nullable = false)

But the output is


     --------------------------- Casting using (String Double)
    +---+---+
    |  a|  b|
    +---+---+
    |  a|  1|
    |  b|  2|
    +---+---+

    root
     |-- a: string (nullable = true)
     |-- b: integer (nullable = false)

     --------------------------- Casting using ClassData_Double
    +---+---+
    |  a|  b|
    +---+---+
    |  a|  1|
    |  b|  2|
    +---+---+

    root
     |-- a: string (nullable = true)
     |-- b: integer (nullable = false)

which shows that column "b" has not been cast to double.

Any hints on what I am doing wrong?

BTW: I am aware of the previous post "How to change column types in Spark SQL's DataFrame?" (see How to change column types in Spark SQL's DataFrame?). I know I can change the type of columns one at a time, but I am looking for a more general solution that changes the schema of the whole data in one shot (and I am trying to understand Spark in the process).

like image 637
Massimo Paolucci Avatar asked Oct 25 '16 06:10

Massimo Paolucci


1 Answers

Well, since functions are chained and Spark does lazy evaluation, it actually does change the schema of the whole data in one shot, even if you do write it as changing one column at the time like this:

import spark.implicits._

df.withColumn("x", 'x.cast(DoubleType)).withColumn("y", 'y.cast(StringType))...

As an alternative, I'm thinking you could use map to do your cast in one go, like:

df.map{t => (t._1, t._2.asInstanceOf[Double], t._3.asInstanceOf[], ...)}
like image 166
Glennie Helles Sindholt Avatar answered Sep 22 '22 23:09

Glennie Helles Sindholt