I have a column 'start_date' which is an integer 37823. This happened when I used xlrd library to convert xlsx to csv. Hence '2003/07/21' got converted to 37823.
I have gone through xlrd documentation and I understand there are several ways to convert it to date. However, I need to convert this to date format using PySpark in AWS Glue ETL jobs. Any suggestions?
I tried using to_date, date_format functions, but nothing worked.
I was able to finally resolve this issue.
Using UDF:
def convert_date(x):
mDt = datetime.datetime(1899, 12, 30)
dlt = mDt + datetime.timedelta(days=x)
return dlt.strftime("%Y-%m-%d")
convert_date_udf = udf(lambda z: convert_date(z), StringType())
df = df.withColumn('hire date', convert_date_udf('hire date').alias('hire date new'))
Without using UDF:
df = df.withColumn('hire date', F.expr("date_add(to_date('1899-12-30'), cast(`hire date` as int))").cast(StringType())
Hope it helps!
I suppose that there are more elegant ways to done this but this is what I have come up with for now.
from datetime import date
df.hire_date = df.hire_date.apply(date.fromordinal) # this will give you date in dash format
df.hire_date = df.hire_date.apply(lambda x: str(x).replace('-', '/')) # this will simply replace dash with slash
Hope that this works for you :)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With