Cast column containing multiple string date formats to DateTime in Spark

Tags:

I have a date column in my Spark DataDrame that contains multiple string formats. I would like to cast these to DateTime.

The two formats in my column are:

mm/dd/yyyy; and
yyyy-mm-dd

My solution so far is to use a UDF to change the first date format to match the second as follows:

import re

def parseDate(dateString):
    if re.match('\d{1,2}\/\d{1,2}\/\d{4}', dateString) is not None:
        return datetime.strptime(dateString, '%M/%d/%Y').strftime('%Y-%M-%d')
    else:
        return dateString

# Create Spark UDF based on above function
dateUdf = udf(parseDate)

df = (df.select(to_date(dateUdf(raw_transactions_df['trans_dt']))))

This works, but is not all that fault-tolerant. I am specifically concerned about:

Date formats I am yet to encounter.
Distinguishing between mm/dd/yyyy and dd/mm/yyyy (the regex I'm using clearly doesn't do this at the moment).

Is there a better way to do this?

713

asked Oct 05 '17 21:10

W05aDePQw6h8e7

1 Answers

Personally I would recommend using SQL functions directly without expensive and inefficient reformatting:

from pyspark.sql.functions import coalesce, to_date

def to_date_(col, formats=("MM/dd/yyyy", "yyyy-MM-dd")):
    # Spark 2.2 or later syntax, for < 2.2 use unix_timestamp and cast
    return coalesce(*[to_date(col, f) for f in formats])

This will choose the first format, which can successfully parse input string.

Usage:

df = spark.createDataFrame([(1, "01/22/2010"), (2, "2018-12-01")], ("id", "dt"))
df.withColumn("pdt", to_date_("dt")).show()

+---+----------+----------+
| id|        dt|       pdt|
+---+----------+----------+
|  1|01/22/2010|2010-01-22|
|  2|2018-12-01|2018-12-01|
+---+----------+----------+

It will be faster than udf, and adding new formats is just a matter of adjusting formats parameter.

However it won't help you with format ambiguities. In general case it might not be possible to do it without manual intervention and cross referencing with external data.

The same thing can be of course done in Scala:

import org.apache.spark.sql.Column
import org.apache.spark.sql.functions.{coalesce, to_date}

def to_date_(col: Column, 
             formats: Seq[String] = Seq("MM/dd/yyyy", "yyyy-MM-dd")) = {
  coalesce(formats.map(f => to_date(col, f)): _*)
}

125

answered Sep 18 '22 13:09

zero323

Related questions
                            
                                NetworkX: how to add weights to an existing G.edges()?
                            
                                How can I sample equally from a dataframe?
                            
                                How to group by one column and sort the values of another column?
                            
                                Trying to understand isolation forest algorithm
                            
                                Django url that captures yyyy-mm-dd date
                            
                                How to remove empty rows from an Pyspark RDD
                            
                                What is a keyword in Robot Framework?
                            
                                Python 3.5 dill pickling/unpickling on different servers: "KeyError: 'ClassType'"
                            
                                How to find Run length encoding in python
                            
                                Two functions in parallel with multiple arguments and return values
                            
                                Is it possible to build reports with Python Pandas?
                            
                                Read from bytes not filename to convert audio
                            
                                Convert string to random but deterministically repeatable uniform probability
                            
                                Implement K-fold cross validation in MLPClassification Python
                            
                                pyMySQL: How to check if connection is already opened or close
                            
                                Doc2Vec Worse Than Mean or Sum of Word2Vec Vectors
                            
                                Flatten layer of PyTorch build by sequential container
                            
                                Python Kivy: how to call a function on button click?
                            
                                Convert two numpy array to dataframe
                            
                                Zooming and saving only a central part of interest in a matplotlib geopandas figure

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Cast column containing multiple string date formats to DateTime in Spark

Tags:

python

apache-spark

apache-spark-sql

pyspark

W05aDePQw6h8e7

People also ask

1 Answers

zero323

Recent Activity

Donate For Us