Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cast column containing multiple string date formats to DateTime in Spark

I have a date column in my Spark DataDrame that contains multiple string formats. I would like to cast these to DateTime.

The two formats in my column are:

  • mm/dd/yyyy; and
  • yyyy-mm-dd

My solution so far is to use a UDF to change the first date format to match the second as follows:

import re

def parseDate(dateString):
    if re.match('\d{1,2}\/\d{1,2}\/\d{4}', dateString) is not None:
        return datetime.strptime(dateString, '%M/%d/%Y').strftime('%Y-%M-%d')
    else:
        return dateString

# Create Spark UDF based on above function
dateUdf = udf(parseDate)

df = (df.select(to_date(dateUdf(raw_transactions_df['trans_dt']))))

This works, but is not all that fault-tolerant. I am specifically concerned about:

  • Date formats I am yet to encounter.
  • Distinguishing between mm/dd/yyyy and dd/mm/yyyy (the regex I'm using clearly doesn't do this at the moment).

Is there a better way to do this?

like image 713
W05aDePQw6h8e7 Avatar asked Oct 05 '17 21:10

W05aDePQw6h8e7


People also ask

How do I convert a String to a date in spark?

Spark to_date() – Convert String to Date format to_date() – function is used to format string ( StringType ) to date ( DateType ) column. Below code, snippet takes the date in a string and converts it to date format on DataFrame.

How do I change the date format of a column in PySpark?

In PySpark use date_format() function to convert the DataFrame column from Date to String format.

How do I cast a column in spark?

Change Column Type using withColumn() and cast() To convert the data type of a DataFrame column, Use withColumn() with the original column name as a first argument and for the second argument apply the casting method cast() with DataType on the column.

How do I cast a timestamp to a spark?

Solution: Using <em>date_format</em>() Spark SQL date function, we can convert Timestamp to the String format. Spark support all Java Data formatted patterns for conversion.


1 Answers

Personally I would recommend using SQL functions directly without expensive and inefficient reformatting:

from pyspark.sql.functions import coalesce, to_date

def to_date_(col, formats=("MM/dd/yyyy", "yyyy-MM-dd")):
    # Spark 2.2 or later syntax, for < 2.2 use unix_timestamp and cast
    return coalesce(*[to_date(col, f) for f in formats])

This will choose the first format, which can successfully parse input string.

Usage:

df = spark.createDataFrame([(1, "01/22/2010"), (2, "2018-12-01")], ("id", "dt"))
df.withColumn("pdt", to_date_("dt")).show()
+---+----------+----------+
| id|        dt|       pdt|
+---+----------+----------+
|  1|01/22/2010|2010-01-22|
|  2|2018-12-01|2018-12-01|
+---+----------+----------+

It will be faster than udf, and adding new formats is just a matter of adjusting formats parameter.

However it won't help you with format ambiguities. In general case it might not be possible to do it without manual intervention and cross referencing with external data.

The same thing can be of course done in Scala:

import org.apache.spark.sql.Column
import org.apache.spark.sql.functions.{coalesce, to_date}

def to_date_(col: Column, 
             formats: Seq[String] = Seq("MM/dd/yyyy", "yyyy-MM-dd")) = {
  coalesce(formats.map(f => to_date(col, f)): _*)
}
like image 125
zero323 Avatar answered Sep 18 '22 13:09

zero323