PySpark: filtering a DataFrame by date field in range where date is string

Tags:

My dataframes contains one field which is a date and it appears in the string format, as example

'2015-07-02T11:22:21.050Z'

I need to filter the DataFrame on the date to get only the records in the last week. So, I was trying a map approach where I transformed the string dates to datetime objects with strptime:

def map_to_datetime(row):
     format_string = '%Y-%m-%dT%H:%M:%S.%fZ'
     row.date = datetime.strptime(row.date, format_string)

df = df.map(map_to_datetime)

and then I would apply a filter as

df.filter(lambda row:
    row.date >= (datetime.today() - timedelta(days=7)))

I manage to get the mapping working but the filter fails with

TypeError: condition should be string or Column

Is there a way to use a filtering in a way that works or should I change the approach and how?

207

asked Mar 20 '16 15:03

mar tin

1 Answers

Spark >= 1.5

You can use INTERVAL

from pyspark.sql.functions import expr, current_date

df_casted.where(col("dt") >= current_date() - expr("INTERVAL 7 days"))

Spark < 1.5

You can solve this without using worker side Python code and switching to RDDs. First of all, since you use ISO 8601 string, your data can be directly casted to date or timestamp:

from pyspark.sql.functions import col

df = sc.parallelize([
    ('2015-07-02T11:22:21.050Z', ),
    ('2016-03-20T21:00:00.000Z', )
]).toDF(("d_str", ))

df_casted = df.select("*",
    col("d_str").cast("date").alias("dt"), 
    col("d_str").cast("timestamp").alias("ts"))

This will save one roundtrip between JVM and Python. There are also a few way you can approach the second part. Date only:

from pyspark.sql.functions import current_date, datediff, unix_timestamp

df_casted.where(datediff(current_date(), col("dt")) < 7)

Timestamps:

def days(i: int) -> int:
    return 60 * 60 * 24 * i

df_casted.where(unix_timestamp() - col("ts").cast("long") < days(7))

You can also take a look at current_timestamp and date_sub

Note: I would avoid using DataFrame.map. It is better to use DataFrame.rdd.map instead. It will save you some work when switching to 2.0+

199

answered Oct 25 '22 14:10

zero323

Related questions
                            
                                Masking user input in python with asterisks
                            
                                get_bucket() gives 'Bad Request' for S3 buckets I didn't create via Boto
                            
                                Adding colors to a 3d quiver plot in matplotlib
                            
                                Traceback when updating status on twitter via Tweepy
                            
                                Pandas selecting discontinuous columns from a dataframe
                            
                                Getting all instances of child node using xml.etree.ElementTree
                            
                                How are the "error bands" in Seaborn tsplot calculated?
                            
                                Plot pandas data frame with year over year data
                            
                                OpenCV remove background
                            
                                How to mutate a list with a function in python?
                            
                                What does the "verbosity" parameter of a random forest mean? (sklearn)
                            
                                How to give foreign key name in django
                            
                                Accessing MySQL from Python 3: Access denied for user
                            
                                Python ASCII codec can't encode character error during write to CSV
                            
                                Tensorflow successfully installs on mac but gets ImportError on copyreg when used [closed]
                            
                                Calculating pairwise correlation among all columns
                            
                                "Map" a nested list in Python
                            
                                nltk StanfordNERTagger : NoClassDefFoundError: org/slf4j/LoggerFactory (In Windows)
                            
                                How to get the entire web page source using Selenium WebDriver in python [duplicate]
                            
                                Self-signed SSL connection using PyMongo

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

PySpark: filtering a DataFrame by date field in range where date is string

Tags:

python

date

datetime

dataframe

pyspark

mar tin

People also ask

1 Answers

zero323

Recent Activity

Donate For Us