Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scala & Spark: Cast multiple columns at once

Since the VectorAssembler is crashing, if a passed column has any other type than NumericType or BooleanType and I'm dealing with a lot of TimestampType columns, I want to know:

Is there a easy way, to cast multiple columns at once?

Based on this answer I already have a convenient way to cast a single column:

def castColumnTo(df: DataFrame, 
    columnName: String, 
    targetType: DataType ) : DataFrame = {
      df.withColumn( columnName, df(columnName).cast(targetType) )
}

I thought about calling castColumnTo recursively, but I strongly doubt that that's the (performant) way to go.

like image 479
Boern Avatar asked Feb 02 '17 08:02

Boern


2 Answers

casting of all columns with idiomatic approach in scala

def castAllTypedColumnsTo(df: DataFrame, sourceType: DataType, targetType: DataType) = {
df.schema.filter(_.dataType == sourceType).foldLeft(df) {
    case (acc, col) => acc.withColumn(col.name, df(col.name).cast(targetType))
 }
}
like image 155
rogue-one Avatar answered Nov 17 '22 19:11

rogue-one


Based on the comments (thanks!) I came up with the following code (no error handling implemented):

def castAllTypedColumnsTo(df: DataFrame, 
   sourceType: DataType, targetType: DataType) : DataFrame = {

      val columnsToBeCasted = df.schema
         .filter(s => s.dataType == sourceType)

      //if(columnsToBeCasted.length > 0) {
      //   println(s"Found ${columnsToBeCasted.length} columns " +
      //      s"(${columnsToBeCasted.map(s => s.name).mkString(",")})" +
      //      s" - casting to ${targetType.typeName.capitalize}Type")
      //}

      columnsToBeCasted.foldLeft(df){(foldedDf, col) => 
         castColumnTo(foldedDf, col.name, LongType)}
}

Thanks for the inspiring comments. foldLeft (explained here and here) saves a for loop to iterate over a var dataframe.

like image 7
Boern Avatar answered Nov 17 '22 20:11

Boern