pyspark's 'between' function is not inclusive for timestamp input.
For example, if we want all rows between two dates, say, '2017-04-13' and '2017-04-14', then it performs an "exclusive" search when the dates are passed as strings. i.e., it omits the '2017-04-14 00:00:00' fields
However, the document seem to hint that it is inclusive (no reference on timestamp though)
Of course, one way is to add a microsecond from the upper bound and pass it to the function. However, not a great fix. Any clean way of doing inclusive search?
Example:
import pandas as pd
from pyspark.sql import functions as F
... sql_context creation ...
test_pd=pd.DataFrame([{"start":'2017-04-13 12:00:00', "value":1.0},{"start":'2017-04-14 00:00:00', "value":1.1}])
test_df = sql_context.createDataFrame(test_pd).withColumn("start", F.col("start").cast('timestamp'))
test_df.show()
+--------------------+-----+
| start|value|
+--------------------+-----+
|2017-04-13 12:00:...| 1.0|
|2017-04-14 00:00:...| 1.1|
+--------------------+-----+
test_df.filter(F.col("start").between('2017-04-13','2017-04-14')).show()
+--------------------+-----+
| start|value|
+--------------------+-----+
|2017-04-13 12:00:...| 1.0|
+--------------------+-----+
pyspark's 'between' function is not inclusive for timestamp input. Of course, one way is to add a microsecond from the upper bound and pass it to the function. However, not a great fix.
Introduction to PySpark TimeStamp. PySpark TIMESTAMP is a python function that is used to convert string function to TimeStamp function. This time stamp function is a format function which is of the type MM – DD – YYYY HH :mm: ss. sss, this denotes the Month, Date, and Hour denoted by the hour, month, and seconds.
Create a new RDD of int containing elements from start to end (exclusive), increased by step every element. Can be called the same way as python's built-in range() function. If called with a single argument, the argument is interpreted as end , and start is set to 0.
Found out the answer. pyspark's "between" function is inconsistent in handling timestamp inputs.
For the above example, here is the output for exclusive search (use pd.to_datetime):
test_df.filter(F.col("start").between(pd.to_datetime('2017-04-13'),pd.to_datetime('2017-04-14'))).show()
+--------------------+-----+
| start|value|
+--------------------+-----+
|2017-04-13 12:00:...| 1.0|
|2017-04-14 00:00:...| 1.1|
+--------------------+-----+
Similarly, if we provide in the date AND time in string format, it seems to perform an inclusive search:
test_df.filter(F.col("start").between('2017-04-13 12:00:00','2017-04-14 00:00:00')).show()
+--------------------+-----+
| start|value|
+--------------------+-----+
|2017-04-13 12:00:...| 1.0|
|2017-04-14 00:00:...| 1.1|
+--------------------+-----+
.between() method is always inclusive. The problem in your example is that when you pass string to .between() method, it treats your data as strings as well. For string comparison, '2017-04-14 00:00:00' is strictly greater than '2017-04-14' because the former is a longer string than the latter, this is why the second date is filtered out in your example. To avoid the "inconsistency", you should pass arguments in datetime format to .between() as follows:
filtered_df = (test_df.filter(F.col("start")
.between(dt.strptime('2017-04-13 12:00:00', '%Y-%m-%d %H:%M:%S'),
dt.strptime('2017-04-14 00:00:00', '%Y-%m-%d %H:%M:%S'))))
This will produce the expected result:
+--------------------+-----+
| start|value|
+--------------------+-----+
|2017-04-13 12:00:...| 1.0|
|2017-04-14 00:00:...| 1.1|
+--------------------+-----+
Just to be clear, if you want to get data from a single date it's better to specify the exact time
ex) Retrieve data only on a single day (2017-04-13)
test_df.filter(F.col("start").between('2017-04-13 00:00:00','2017-04-13 23:59:59.59')
cf) if you set the date as between '2017-04-13', '2017-04-14' this will include 2017-04-14 00:00:00 data also, which technically isn't the data you want to pull out since it's 2017-04-14 data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With