Change Data Types for Dataframe by Schema in Scala Spark

Question

I have a dataframe without schema and every column stored as StringType such as:

ID | LOG_IN_DATE | USER
1  | 2017-11-01  | Johns

Now I created a schema dataframe as [(ID,"double"),("LOG_IN_DATE","date"),(USER,"string")] and I would like to apply to the above Dataframe in Spark 2.0.2 with Scala 2.11.

I already tried:

schema.map(x => df.withColumn(x._1, col(x._1).cast(x._2)))

There's no error while running this but afterwards when I call the df.schema, nothing is changed.

Any idea on how I could programmatically apply the schema to df? My friend told me I can use foldLeft method but I don't think this is a method in Spark 2.0.2 neither in df nor rdd.

koiralo · Accepted Answer

If you already have a list [(ID,"double"),("LOG_IN_DATE","date"),(USER,"string")], you can use select with each column casting to its type from the list

Your dataframe

val df = Seq(("1", "2017-11-01", "Johns"), ("2", "2018-01-03", "jons2")).toDF("ID", "LOG_IN_DATE", "USER")

Your schema

val schema = List(("ID", "double"), ("LOG_IN_DATE", "date"), ("USER", "string"))

Cast all the columns to its type from the list

val newColumns = schema.map(c => col(c._1).cast(c._2))

select all te casted columns

val newDF = df.select(newColumns:_*)

Print Schema

newDF.printSchema()

root
 |-- ID: double (nullable = true)
 |-- LOG_IN_DATE: date (nullable = true)
 |-- USER: string (nullable = true)

Show Dataframe

newDF.show()

Output:

+---+-----------+-----+
|ID |LOG_IN_DATE|USER |
+---+-----------+-----+
|1.0|2017-11-01 |Johns|
|2.0|2018-01-03 |Jons2|
+---+-----------+-----+

Change Data Types for Dataframe by Schema in Scala Spark

Tags:

scala

apache-spark

apache-spark-sql

Sidi

1 Answers

koiralo

Recent Activity

Donate For Us

Change Data Types for Dataframe by Schema in Scala Spark

Tags:

scala

apache-spark

apache-spark-sql

Sidi

1 Answers

koiralo

Related questions

Recent Activity

Donate For Us