Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pyspark's "between" function: range search on timestamps is not inclusive

pyspark's 'between' function is not inclusive for timestamp input.

For example, if we want all rows between two dates, say, '2017-04-13' and '2017-04-14', then it performs an "exclusive" search when the dates are passed as strings. i.e., it omits the '2017-04-14 00:00:00' fields

However, the document seem to hint that it is inclusive (no reference on timestamp though)

Of course, one way is to add a microsecond from the upper bound and pass it to the function. However, not a great fix. Any clean way of doing inclusive search?

Example:

import pandas as pd
from pyspark.sql import functions as F
... sql_context creation ...
test_pd=pd.DataFrame([{"start":'2017-04-13 12:00:00', "value":1.0},{"start":'2017-04-14 00:00:00', "value":1.1}])
test_df = sql_context.createDataFrame(test_pd).withColumn("start", F.col("start").cast('timestamp'))
test_df.show()

+--------------------+-----+
|               start|value|
+--------------------+-----+
|2017-04-13 12:00:...|  1.0|
|2017-04-14 00:00:...|  1.1|
+--------------------+-----+

test_df.filter(F.col("start").between('2017-04-13','2017-04-14')).show()

+--------------------+-----+
|               start|value|
+--------------------+-----+
|2017-04-13 12:00:...|  1.0|
+--------------------+-----+
like image 417
Vinay Kolar Avatar asked Apr 14 '17 01:04

Vinay Kolar


People also ask

Is between PySpark inclusive?

pyspark's 'between' function is not inclusive for timestamp input. Of course, one way is to add a microsecond from the upper bound and pass it to the function. However, not a great fix.

How do you use TimeStamp in PySpark?

Introduction to PySpark TimeStamp. PySpark TIMESTAMP is a python function that is used to convert string function to TimeStamp function. This time stamp function is a format function which is of the type MM – DD – YYYY HH :mm: ss. sss, this denotes the Month, Date, and Hour denoted by the hour, month, and seconds.

How do you use a range in PySpark?

Create a new RDD of int containing elements from start to end (exclusive), increased by step every element. Can be called the same way as python's built-in range() function. If called with a single argument, the argument is interpreted as end , and start is set to 0.


3 Answers

Found out the answer. pyspark's "between" function is inconsistent in handling timestamp inputs.

  1. If you provide the the input in string format without time, it performs an exclusive search (Not what we expect from the documentation linked above).
  2. If you provide the input as datetime object or with exact time (e.g., '2017-04-14 00:00:00', then it performs an inclusive search.

For the above example, here is the output for exclusive search (use pd.to_datetime):

test_df.filter(F.col("start").between(pd.to_datetime('2017-04-13'),pd.to_datetime('2017-04-14'))).show()

+--------------------+-----+
|               start|value|
+--------------------+-----+
|2017-04-13 12:00:...|  1.0|
|2017-04-14 00:00:...|  1.1|
+--------------------+-----+

Similarly, if we provide in the date AND time in string format, it seems to perform an inclusive search:

test_df.filter(F.col("start").between('2017-04-13 12:00:00','2017-04-14 00:00:00')).show()

+--------------------+-----+
|               start|value|
+--------------------+-----+
|2017-04-13 12:00:...|  1.0|
|2017-04-14 00:00:...|  1.1|
+--------------------+-----+
like image 76
Vinay Kolar Avatar answered Oct 04 '22 21:10

Vinay Kolar


.between() method is always inclusive. The problem in your example is that when you pass string to .between() method, it treats your data as strings as well. For string comparison, '2017-04-14 00:00:00' is strictly greater than '2017-04-14' because the former is a longer string than the latter, this is why the second date is filtered out in your example. To avoid the "inconsistency", you should pass arguments in datetime format to .between() as follows:

filtered_df = (test_df.filter(F.col("start")
                .between(dt.strptime('2017-04-13 12:00:00', '%Y-%m-%d %H:%M:%S'), 
                         dt.strptime('2017-04-14 00:00:00', '%Y-%m-%d %H:%M:%S'))))

This will produce the expected result:

+--------------------+-----+
|               start|value|
+--------------------+-----+
|2017-04-13 12:00:...|  1.0|
|2017-04-14 00:00:...|  1.1|
+--------------------+-----+
like image 36
Anna K. Avatar answered Oct 04 '22 21:10

Anna K.


Just to be clear, if you want to get data from a single date it's better to specify the exact time

ex) Retrieve data only on a single day (2017-04-13)

test_df.filter(F.col("start").between('2017-04-13 00:00:00','2017-04-13 23:59:59.59') 

cf) if you set the date as between '2017-04-13', '2017-04-14' this will include 2017-04-14 00:00:00 data also, which technically isn't the data you want to pull out since it's 2017-04-14 data.

like image 45
dkdlfls26 Avatar answered Oct 04 '22 23:10

dkdlfls26