Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Change Data Types for Dataframe by Schema in Scala Spark

I have a dataframe without schema and every column stored as StringType such as:

ID | LOG_IN_DATE | USER
1  | 2017-11-01  | Johns

Now I created a schema dataframe as [(ID,"double"),("LOG_IN_DATE","date"),(USER,"string")] and I would like to apply to the above Dataframe in Spark 2.0.2 with Scala 2.11.

I already tried:

schema.map(x => df.withColumn(x._1, col(x._1).cast(x._2)))

There's no error while running this but afterwards when I call the df.schema, nothing is changed.

Any idea on how I could programmatically apply the schema to df? My friend told me I can use foldLeft method but I don't think this is a method in Spark 2.0.2 neither in df nor rdd.

like image 481
Sidi Avatar asked Dec 14 '25 10:12

Sidi


1 Answers

If you already have a list [(ID,"double"),("LOG_IN_DATE","date"),(USER,"string")], you can use select with each column casting to its type from the list

Your dataframe

val df = Seq(("1", "2017-11-01", "Johns"), ("2", "2018-01-03", "jons2")).toDF("ID", "LOG_IN_DATE", "USER")

Your schema

val schema = List(("ID", "double"), ("LOG_IN_DATE", "date"), ("USER", "string"))

Cast all the columns to its type from the list

val newColumns = schema.map(c => col(c._1).cast(c._2))

select all te casted columns

val newDF = df.select(newColumns:_*)

Print Schema

newDF.printSchema()

root
 |-- ID: double (nullable = true)
 |-- LOG_IN_DATE: date (nullable = true)
 |-- USER: string (nullable = true)

Show Dataframe

newDF.show()

Output:

+---+-----------+-----+
|ID |LOG_IN_DATE|USER |
+---+-----------+-----+
|1.0|2017-11-01 |Johns|
|2.0|2018-01-03 |Jons2|
+---+-----------+-----+
like image 144
koiralo Avatar answered Dec 15 '25 22:12

koiralo



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!