I have the following dataframe:
corr_temp_df
[('vacationdate', 'date'),
('valueE', 'string'),
('valueD', 'string'),
('valueC', 'string'),
('valueB', 'string'),
('valueA', 'string')]
Now I would like to change the datatype of the column vacationdate to String, so that also the dataframe takes this new type and overwrites the datatype data for all of the entries. E.g. after writing:
corr_temp_df.dtypes
The datatype of vacationdate should be overwritten.
I already used functions like cast, StringType or astype, but I was not successful. Do you know how to do that?
Lets create some dummy data:
import datetime
from pyspark.sql import Row
from pyspark.sql.functions import col
row = Row("vacationdate")
df = sc.parallelize([
row(datetime.date(2015, 10, 07)),
row(datetime.date(1971, 01, 01))
]).toDF()
If you Spark >= 1.5.0 you can use date_format
function:
from pyspark.sql.functions import date_format
(df
.select(date_format(col("vacationdate"), "dd-MM-YYYY")
.alias("date_string"))
.show())
In Spark < 1.5.0 it can be done using Hive UDF:
df.registerTempTable("df")
sqlContext.sql(
"SELECT date_format(vacationdate, 'dd-MM-YYYY') AS date_string FROM df")
It is of course still available in Spark >= 1.5.0.
If you don't use HiveContext
you can mimic date_format
using UDF:
from pyspark.sql.functions import udf, lit
my_date_format = udf(lambda d, fmt: d.strftime(fmt))
df.select(
my_date_format(col("vacationdate"), lit("%d-%m-%Y")).alias("date_string")
).show()
Please note it is using C standard format not a Java simple date format
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With