In pyspark, I have a dataframe that has dates that get imported as strings. There are null values in these dates-as-strings columns. I'm trying to convert these columns into date type columns, but I keep getting errors. Here's a small example of the dataframe:
+--------+----------+----------+
|DeviceId|   Created| EventDate|
+--------+----------+----------+
|       1|      null|2017-03-09|
|       1|      null|2017-03-09|
|       1|2017-03-09|2017-03-09|
|       1|2017-03-15|2017-03-15|
|       1|      null|2017-05-06|
|       1|2017-05-06|2017-05-06|
|       1|      null|      null|
+--------+----------+----------+
When there are no null values, I have found that this code below will work to convert the data types:
dt_func =  udf (lambda x: datetime.strptime(x, '%Y-%m-%d'), DateType())    
df = df.withColumn('Created', dt_func(col('Created')))
Once I add null values it crashes. I've tried to modify the udf to account for nulls as follows:
import numpy as np
def convertDatetime(x):
    return sf.when(x.isNull(), 'null').otherwise(datetime.strptime(x, '%Y-%m-%d'))
dt_func =  udf(convertDatetime, DateType())
I also tried filling the nulls with an arbitrary date-string, converting the columns to dates, and then trying to replace the arbitrary fill date with nulls as below:
def dt_conv(df, cols, form = '%Y-%m-%d', temp_plug = '1900-01-01'):
    df = df.na.fill(temp_plug)
    dt_func =  udf (lambda x: datetime.strptime(x, form), DateType())
    for col_ in cols:
        df = df.withColumn(col_, dt_func(col(col_)))
    df = df.replace(datetime.strptime(temp_plug, form), 'null')
    return df
However, this method gives me this error
ValueError: to_replace should be a float, int, long, string, list, tuple, or dict
Can someone help me figure this out?
try this -
# Some data, I added empty strings and nulls both
data = [(1,'','2017-03-09'),(1,None,'2017-03-09'),(1,'2017-03-09','2017-03-09')]
df = spark.createDataFrame(data).toDF('id','Created','EventDate')
df.show()
:
+---+----------+----------+
| id|   Created| EventDate|
+---+----------+----------+
|  1|          |2017-03-09|
|  1|      null|2017-03-09|
|  1|2017-03-09|2017-03-09|
+---+----------+----------+
:
df\
.withColumn('Created-formatted',when((df.Created.isNull() | (df.Created == '')) ,'0')\
.otherwise(unix_timestamp(df.Created,'yyyy-MM-dd')))\
.withColumn('EventDate-formatted',when((df.EventDate.isNull() | (df.EventDate == '')) ,'0')\
.otherwise(unix_timestamp(df.EventDate,'yyyy-MM-dd')))\
.drop('Created','EventDate')\
.show()
:
+---+-----------------+-------------------+
| id|Created-formatted|EventDate-formatted|
+---+-----------------+-------------------+
|  1|                0|         1489035600|
|  1|                0|         1489035600|
|  1|       1489035600|         1489035600|
+---+-----------------+-------------------+
I used unix_timestamp which returns BigInt format but you can format that columns as you like .
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With