Unexpected number of bins in Pandas DataFrame resample

Problem

I need to reduce the length of a DataFrame to some externally defined integer (could be two rows, 10,000 rows, etc., but will always be a reduction in overall length), but I also want to keep the resulting DataFrame representative of the original. The original DataFrame (we'll call it df) has a datetime column (utc_time) and a data value column (data_value). The datetimes are always sequential, non-repeating, though not evenly spaced (i.e., data might be "missing"). For the DataFrame in this example, the timestamps are at ten minute intervals (when data is present).

Attempts

To accomplish this, my mind immediately went to resampling with the following logic: find the difference in seconds between the first and last timestamps, divide that by the desired final length, and that's the resampling factor. I set this up here:

# Define the desired final length.
final_length = 2
# Define the first timestamp.
first_timestamp = df['utc_time'].min().timestamp()
# Define the last timestamp.
last_timestamp = df['utc_time'].max().timestamp()
# Define the difference in seconds between the first and last timestamps.
delta_t = last_timestamp - first_timestamp
# Define the resampling factor.
resampling_factor = np.ceil(delta_t / final_length)

# Set the index from the `utc_time` column so that we can resample nicely.
df.set_index('utc_time', drop=True, inplace=True)
# Do the resampling.
resamp = df.resample(f'{resampling_factor}S')

To look at resamp, I simply looped and printed:

for i in resamp:
    print(i)

This yielded (with some cleanup on my part) the following:

 utc_time                  data_value
 2016-09-28 21:10:00       140.0
 2016-09-28 21:20:00       250.0
 2016-09-28 21:30:00       250.0
 2016-09-28 21:40:00       240.0
 2016-09-28 21:50:00       240.0
 ...                         ...
 2018-08-06 13:00:00       240.0
 2018-08-06 13:10:00       240.0
 2018-08-06 13:20:00       240.0
 2018-08-06 13:30:00       240.0
 2018-08-06 13:40:00       230.0
 
 [69889 rows x 1 columns])

 utc_time                  data_value
 2018-08-06 13:50:00       230.0
 2018-08-06 14:00:00       230.0
 2018-08-06 14:10:00       230.0
 2018-08-06 14:20:00       230.0
 2018-08-06 14:30:00       230.0
 ...                         ...
 2020-06-14 02:50:00       280.0
 2020-06-14 03:00:00       280.0
 2020-06-14 03:10:00       280.0
 2020-06-14 03:20:00       280.0
 2020-06-14 03:30:00       280.0
 
 [97571 rows x 1 columns])

 utc_time                  data_value
 2020-06-14 03:40:00       280.0
 2020-06-14 03:50:00       280.0
 2020-06-14 04:00:00       280.0
 2020-06-14 04:10:00       280.0
 2020-06-14 04:20:00       280.0
 ...                         ...
 2020-06-15 00:10:00       280.0
 2020-06-15 00:20:00       270.0
 2020-06-15 00:30:00       270.0
 2020-06-15 00:40:00       270.0
 2020-06-15 00:50:00       280.0

 [128 rows x 1 columns])

As one can see, this produced three bins rather than the two I expected.

I could do something different, like changing the way I choose the resampling factor (e.g., finding the average time between timestamps, and multiplying that by (length of DataFrame / final_length) should yield a more conservative resampling factor), but that would, to my mind, be a mask to the underlying issue. Mainly, I'd love to understand why what's happening is happening. Which leads to...

Question

Does anyone know why this is happening, and what steps I might take to ensure we get the desired number of bins? I wonder if it's an offsetting issue - that is, although we see the first timestamp in the first bin as the first timestamp from the DataFrame, perhaps pandas is actually starting the bin before then?

For anyone who'd like to play along at home, the test DataFrame can be found here as a .csv. To get it in as a DataFrame:

df = pd.read_csv('test.csv', parse_dates=[0])

579

asked Jul 16 '20 15:07

erekalper

1 Answers

Summary

Problem 1 & fix: The way you form the bins will make one extra bin since the bins created with df.resample() will be closed only on one end (left or right). Fix this with one of options listed in "1.".
Problem 2 & fix: The first bin left edge is at the start of that day ('2016-09-28 00:00:00') (See "2."). You can fix it by using kind='period' as argument to resample(). (See "3.")

1. Having a glance at the input data (& what kind of bins we need)

The input data is from 2016-09-28 21:10:00 to 2020-06-15 00:50:00, and using the resampling_factor you have, we get:

In [63]: df.index.min()
Out[63]: Timestamp('2016-09-28 21:10:00')

In [64]: df.index.min() + pd.Timedelta(f'{resampling_factor}S')
Out[64]: Timestamp('2018-08-07 11:00:00')

In [65]: _ + pd.Timedelta(f'{resampling_factor}S')
Out[65]: Timestamp('2020-06-15 00:50:00')

To partition data into two pieces with these timestamps, we would need bins to be

['2016-09-28 21:10:00', '2018-08-07 11:00:00')
['2018-08-07 11:00:00', '2020-06-15 00:50:00']

(The [ means closed end and ( means open end)

Here is one problem: You can not form bins that are closed from both ends. You will have to decide if you want to close the bins from left or right (argument closed='left'|'right',). With closed='left' you would have
- ['2016-09-28 21:10:00', '2018-08-07 11:00:00')
- ['2018-08-07 11:00:00', '2020-06-15 00:50:00')
- ['2020-06-15 00:50:00', '2022-04-23 14:40:00') (only one entry here)

Possible fixes:

Adjust your last timestamp by adding some time into it:

    last_timestamp = (df['utc_time'].max() +
                      pd.Timedelta('10 minutes')).timestamp()

Make the resampling_factor a bit larger than you first calculated.
Just use the first two dataframes from the df.resample and disregard the third which has only one or few entries

Choose which makes most sense in your application.

2. Looking at what we have now

From the df.resample docs, we know that the labels returned are the left bin edges
If we look the data, we see the what kind of labels there are now.


In [67]: resamp = df.resample(f'{resampling_factor}S')

In [68]: itr = iter(resamp)

In [69]: next(itr)
Out[69]:
(Timestamp('2016-09-28 00:00:00', freq='58542600S'),
                      data_value
 utc_time
 2016-09-28 21:10:00       140.0
 ...                         ...
 2018-08-06 13:40:00       230.0

 [69889 rows x 1 columns])

In [70]: next(itr)
Out[70]:
(Timestamp('2018-08-06 13:50:00', freq='58542600S'),
                      data_value
 utc_time
 2018-08-06 13:50:00       230.0
 ...                         ...
 2020-06-14 03:30:00       280.0

 [97571 rows x 1 columns])

In [71]: next(itr)
Out[71]:
(Timestamp('2020-06-14 03:40:00', freq='58542600S'),
                      data_value
 utc_time
 2020-06-14 03:40:00       280.0
 ...                         ...
 2020-06-15 00:50:00       280.0

 [128 rows x 1 columns])

The bins are therefore
- ['2016-09-28 00:00:00', '2018-08-06 13:50:00')
- ['2018-08-06 13:50:00', '2020-06-14 03:40:00')
- ['2020-06-14 03:40:00', '2022-04-22 17:30:00') (Endpoint calculated by adding resampling_factor to the beginning of the bin.)
We see that the first bin does not start from the df['utc_time'].min (2016-09-28 21:10:00), but it starts from the beginning of that day (as you guessed)
Since the first bin starts before intended, we have data outside two bins, in a third bin.

3. Fixing the starting bin left edge

The kind argument can be either 'timestamp' or 'period'. If you change it into 'period', you will have following bins (with closed='left'):

['2016-09-28 21:10:00', '2018-08-07 11:00:00') <-- fixed
['2018-08-07 11:00:00', '2020-06-15 00:50:00')
['2020-06-15 00:50:00', '2022-04-23 14:40:00') (Remove with options given in "1.")

answered Nov 10 '22 11:11

np8

Related questions
                            
                                How to exit cleanly from flask and waitress running as a windows pywin32 service
                            
                                Is there a way to take screen shots of desktop that not current active one using mss?
                            
                                How to set navigator.webdriver to undefined with Selenium for Firefox (geckodriver)
                            
                                VSCode autocomplete not working for OpenCV installed from source
                            
                                Install Detectron2 on Windows 10
                            
                                How to implement maclaurin series in keras?
                            
                                Is "if __name__ == '__main__'" required in a __main__.py?
                            
                                How do I run a program installed with pip in windows?
                            
                                Comparing Plumbr to other options for making a chart with R in a Python script
                            
                                Tkinter progress bar how to correctly implement it in a model dialog box
                            
                                A quick way to write a decision into a column based on the corresponding rows using pandas?
                            
                                Changing in the Quantity of variants reflecting in the wrong item in Order Summary
                            
                                Google Collab How to show value of assignments?
                            
                                Even though tuples are immutable, they are stored in different addresses in interactive mode. Why?
                            
                                Delete an element from torch.Tensor
                            
                                Why does django's `apps.get_model()` return a `__fake__.MyModel` object
                            
                                ValueError: illegal value in 4-th argument of internal None when running sklearn LinearRegression().fit()
                            
                                how to download all the python packages mentioned in the requirement.txt to a folder in linux?
                            
                                Create CSV from XML/Json using Python Pandas
                            
                                Length (count) of sequences with start and end condition Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Unexpected number of bins in Pandas DataFrame resample

Tags:

python

python-3.x

pandas

dataframe

resampling