I have a date pyspark dataframe with a string column in the format of MM-dd-yyyy
and I am attempting to convert this into a date column.
I tried:
df.select(to_date(df.STRING_COLUMN).alias('new_date')).show()
And I get a string of nulls. Can anyone help?
Using strptime() , date and time in string format can be converted to datetime type. The first parameter is the string and the second is the date time format specifier. One advantage of converting to date format is one can select the month or date or time individually.
In PySpark use date_format() function to convert the DataFrame column from Date to String format. In this tutorial, we will show you a Spark SQL example of how to convert Date to String format using date_format() function on DataFrame. date_format() – function formats Date to String format.
The to_timestamp() function in Apache PySpark is popularly used to convert String to the Timestamp(i.e., Timestamp Type). The default format of the Timestamp is "MM-dd-yyyy HH:mm: ss. SSS," and if the input is not in the specified form, it returns Null.
Update (1/10/2018):
For Spark 2.2+ the best way to do this is probably using the to_date
or to_timestamp
functions, which both support the format
argument. From the docs:
>>> from pyspark.sql.functions import to_timestamp
>>> df = spark.createDataFrame([('1997-02-28 10:30:00',)], ['t'])
>>> df.select(to_timestamp(df.t, 'yyyy-MM-dd HH:mm:ss').alias('dt')).collect()
[Row(dt=datetime.datetime(1997, 2, 28, 10, 30))]
Original Answer (for Spark < 2.2)
It is possible (preferrable?) to do this without a udf:
from pyspark.sql.functions import unix_timestamp, from_unixtime
df = spark.createDataFrame(
[("11/25/1991",), ("11/24/1991",), ("11/30/1991",)],
['date_str']
)
df2 = df.select(
'date_str',
from_unixtime(unix_timestamp('date_str', 'MM/dd/yyy')).alias('date')
)
print(df2)
#DataFrame[date_str: string, date: timestamp]
df2.show(truncate=False)
#+----------+-------------------+
#|date_str |date |
#+----------+-------------------+
#|11/25/1991|1991-11-25 00:00:00|
#|11/24/1991|1991-11-24 00:00:00|
#|11/30/1991|1991-11-30 00:00:00|
#+----------+-------------------+
from datetime import datetime
from pyspark.sql.functions import col, udf
from pyspark.sql.types import DateType
# Creation of a dummy dataframe:
df1 = sqlContext.createDataFrame([("11/25/1991","11/24/1991","11/30/1991"),
("11/25/1391","11/24/1992","11/30/1992")], schema=['first', 'second', 'third'])
# Setting an user define function:
# This function converts the string cell into a date:
func = udf (lambda x: datetime.strptime(x, '%m/%d/%Y'), DateType())
df = df1.withColumn('test', func(col('first')))
df.show()
df.printSchema()
Here is the output:
+----------+----------+----------+----------+
| first| second| third| test|
+----------+----------+----------+----------+
|11/25/1991|11/24/1991|11/30/1991|1991-01-25|
|11/25/1391|11/24/1992|11/30/1992|1391-01-17|
+----------+----------+----------+----------+
root
|-- first: string (nullable = true)
|-- second: string (nullable = true)
|-- third: string (nullable = true)
|-- test: date (nullable = true)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With