pyspark's "between" function: range search on timestamps is not inclusive

Tags:

pyspark's 'between' function is not inclusive for timestamp input.

For example, if we want all rows between two dates, say, '2017-04-13' and '2017-04-14', then it performs an "exclusive" search when the dates are passed as strings. i.e., it omits the '2017-04-14 00:00:00' fields

However, the document seem to hint that it is inclusive (no reference on timestamp though)

Of course, one way is to add a microsecond from the upper bound and pass it to the function. However, not a great fix. Any clean way of doing inclusive search?

Example:

import pandas as pd
from pyspark.sql import functions as F
... sql_context creation ...
test_pd=pd.DataFrame([{"start":'2017-04-13 12:00:00', "value":1.0},{"start":'2017-04-14 00:00:00', "value":1.1}])
test_df = sql_context.createDataFrame(test_pd).withColumn("start", F.col("start").cast('timestamp'))
test_df.show()

+--------------------+-----+
|               start|value|
+--------------------+-----+
|2017-04-13 12:00:...|  1.0|
|2017-04-14 00:00:...|  1.1|
+--------------------+-----+

test_df.filter(F.col("start").between('2017-04-13','2017-04-14')).show()

+--------------------+-----+
|               start|value|
+--------------------+-----+
|2017-04-13 12:00:...|  1.0|
+--------------------+-----+

417

asked Apr 14 '17 01:04

Vinay Kolar

3 Answers

Found out the answer. pyspark's "between" function is inconsistent in handling timestamp inputs.

If you provide the the input in string format without time, it performs an exclusive search (Not what we expect from the documentation linked above).
If you provide the input as datetime object or with exact time (e.g., '2017-04-14 00:00:00', then it performs an inclusive search.

For the above example, here is the output for exclusive search (use pd.to_datetime):

test_df.filter(F.col("start").between(pd.to_datetime('2017-04-13'),pd.to_datetime('2017-04-14'))).show()

+--------------------+-----+
|               start|value|
+--------------------+-----+
|2017-04-13 12:00:...|  1.0|
|2017-04-14 00:00:...|  1.1|
+--------------------+-----+

Similarly, if we provide in the date AND time in string format, it seems to perform an inclusive search:

test_df.filter(F.col("start").between('2017-04-13 12:00:00','2017-04-14 00:00:00')).show()

+--------------------+-----+
|               start|value|
+--------------------+-----+
|2017-04-13 12:00:...|  1.0|
|2017-04-14 00:00:...|  1.1|
+--------------------+-----+

answered Oct 04 '22 21:10

Vinay Kolar

.between() method is always inclusive. The problem in your example is that when you pass string to .between() method, it treats your data as strings as well. For string comparison, '2017-04-14 00:00:00' is strictly greater than '2017-04-14' because the former is a longer string than the latter, this is why the second date is filtered out in your example. To avoid the "inconsistency", you should pass arguments in datetime format to .between() as follows:

filtered_df = (test_df.filter(F.col("start")
                .between(dt.strptime('2017-04-13 12:00:00', '%Y-%m-%d %H:%M:%S'), 
                         dt.strptime('2017-04-14 00:00:00', '%Y-%m-%d %H:%M:%S'))))

This will produce the expected result:

+--------------------+-----+
|               start|value|
+--------------------+-----+
|2017-04-13 12:00:...|  1.0|
|2017-04-14 00:00:...|  1.1|
+--------------------+-----+

answered Oct 04 '22 21:10

Anna K.

Just to be clear, if you want to get data from a single date it's better to specify the exact time

ex) Retrieve data only on a single day (2017-04-13)

test_df.filter(F.col("start").between('2017-04-13 00:00:00','2017-04-13 23:59:59.59')

cf) if you set the date as between '2017-04-13', '2017-04-14' this will include 2017-04-14 00:00:00 data also, which technically isn't the data you want to pull out since it's 2017-04-14 data.

answered Oct 04 '22 23:10

dkdlfls26

Related questions
                            
                                python3: singledispatch in class, how to dispatch self type
                            
                                Best way to plot an angle between two lines in Matplotlib
                            
                                Python odbc; how to find all tables in an odbc
                            
                                NameError: name 'requests' is not defined [closed]
                            
                                rsync skip non existing files on source
                            
                                Create vertical NumPy arrays in Python
                            
                                OpenCV return keypoints coordinates and area from blob detection, Python
                            
                                pandas dataframe hexbin plot has no xlabel or axis values
                            
                                Spark - Creating Nested DataFrame
                            
                                Add a legend to my heatmap plot
                            
                                SyntaxError with passing **kwargs and trailing comma
                            
                                Flask hangs when sending a post request to itself
                            
                                How to define custom properties in enumeration in Python (Javascript-like) [duplicate]
                            
                                How to extract zip file recursively?
                            
                                Converting a float to bytearray
                            
                                Can't build wheel - error: invalid command 'bdist_wheel'
                            
                                Remove empty sub plots in matplotlib figure
                            
                                How to Remove a Substring of String in a Dataframe Column?
                            
                                What is a mapping object, according to dict type?
                            
                                "Invalid parameter type" (numpy.int64) when inserting rows with executemany()

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

pyspark's "between" function: range search on timestamps is not inclusive

Tags:

python

datetime

range

pyspark

between

Vinay Kolar

People also ask

3 Answers

Vinay Kolar

Anna K.

dkdlfls26

Recent Activity

Donate For Us