Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert pyspark string to date format

I have a date pyspark dataframe with a string column in the format of MM-dd-yyyy and I am attempting to convert this into a date column.

I tried:

df.select(to_date(df.STRING_COLUMN).alias('new_date')).show()

And I get a string of nulls. Can anyone help?

like image 689
Jenks Avatar asked Jun 28 '16 15:06

Jenks


People also ask

How do I convert a string to a date?

Using strptime() , date and time in string format can be converted to datetime type. The first parameter is the string and the second is the date time format specifier. One advantage of converting to date format is one can select the month or date or time individually.

How do I change date format in PySpark?

In PySpark use date_format() function to convert the DataFrame column from Date to String format. In this tutorial, we will show you a Spark SQL example of how to convert Date to String format using date_format() function on DataFrame. date_format() – function formats Date to String format.

How do I convert a string to timestamp in PySpark?

The to_timestamp() function in Apache PySpark is popularly used to convert String to the Timestamp(i.e., Timestamp Type). The default format of the Timestamp is "MM-dd-yyyy HH:mm: ss. SSS," and if the input is not in the specified form, it returns Null.


2 Answers

Update (1/10/2018):

For Spark 2.2+ the best way to do this is probably using the to_date or to_timestamp functions, which both support the format argument. From the docs:

>>> from pyspark.sql.functions import to_timestamp
>>> df = spark.createDataFrame([('1997-02-28 10:30:00',)], ['t'])
>>> df.select(to_timestamp(df.t, 'yyyy-MM-dd HH:mm:ss').alias('dt')).collect()
[Row(dt=datetime.datetime(1997, 2, 28, 10, 30))]

Original Answer (for Spark < 2.2)

It is possible (preferrable?) to do this without a udf:

from pyspark.sql.functions import unix_timestamp, from_unixtime

df = spark.createDataFrame(
    [("11/25/1991",), ("11/24/1991",), ("11/30/1991",)], 
    ['date_str']
)

df2 = df.select(
    'date_str', 
    from_unixtime(unix_timestamp('date_str', 'MM/dd/yyy')).alias('date')
)

print(df2)
#DataFrame[date_str: string, date: timestamp]

df2.show(truncate=False)
#+----------+-------------------+
#|date_str  |date               |
#+----------+-------------------+
#|11/25/1991|1991-11-25 00:00:00|
#|11/24/1991|1991-11-24 00:00:00|
#|11/30/1991|1991-11-30 00:00:00|
#+----------+-------------------+
like image 106
santon Avatar answered Oct 21 '22 14:10

santon


from datetime import datetime
from pyspark.sql.functions import col, udf
from pyspark.sql.types import DateType



# Creation of a dummy dataframe:
df1 = sqlContext.createDataFrame([("11/25/1991","11/24/1991","11/30/1991"), 
                            ("11/25/1391","11/24/1992","11/30/1992")], schema=['first', 'second', 'third'])

# Setting an user define function:
# This function converts the string cell into a date:
func =  udf (lambda x: datetime.strptime(x, '%m/%d/%Y'), DateType())

df = df1.withColumn('test', func(col('first')))

df.show()

df.printSchema()

Here is the output:

+----------+----------+----------+----------+
|     first|    second|     third|      test|
+----------+----------+----------+----------+
|11/25/1991|11/24/1991|11/30/1991|1991-01-25|
|11/25/1391|11/24/1992|11/30/1992|1391-01-17|
+----------+----------+----------+----------+

root
 |-- first: string (nullable = true)
 |-- second: string (nullable = true)
 |-- third: string (nullable = true)
 |-- test: date (nullable = true)
like image 49
Hugo Reyes Avatar answered Oct 21 '22 15:10

Hugo Reyes