Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unexpected number of bins in Pandas DataFrame resample

Problem

I need to reduce the length of a DataFrame to some externally defined integer (could be two rows, 10,000 rows, etc., but will always be a reduction in overall length), but I also want to keep the resulting DataFrame representative of the original. The original DataFrame (we'll call it df) has a datetime column (utc_time) and a data value column (data_value). The datetimes are always sequential, non-repeating, though not evenly spaced (i.e., data might be "missing"). For the DataFrame in this example, the timestamps are at ten minute intervals (when data is present).

Attempts

To accomplish this, my mind immediately went to resampling with the following logic: find the difference in seconds between the first and last timestamps, divide that by the desired final length, and that's the resampling factor. I set this up here:

# Define the desired final length.
final_length = 2
# Define the first timestamp.
first_timestamp = df['utc_time'].min().timestamp()
# Define the last timestamp.
last_timestamp = df['utc_time'].max().timestamp()
# Define the difference in seconds between the first and last timestamps.
delta_t = last_timestamp - first_timestamp
# Define the resampling factor.
resampling_factor = np.ceil(delta_t / final_length)

# Set the index from the `utc_time` column so that we can resample nicely.
df.set_index('utc_time', drop=True, inplace=True)
# Do the resampling.
resamp = df.resample(f'{resampling_factor}S')

To look at resamp, I simply looped and printed:

for i in resamp:
    print(i)

This yielded (with some cleanup on my part) the following:

 utc_time                  data_value
 2016-09-28 21:10:00       140.0
 2016-09-28 21:20:00       250.0
 2016-09-28 21:30:00       250.0
 2016-09-28 21:40:00       240.0
 2016-09-28 21:50:00       240.0
 ...                         ...
 2018-08-06 13:00:00       240.0
 2018-08-06 13:10:00       240.0
 2018-08-06 13:20:00       240.0
 2018-08-06 13:30:00       240.0
 2018-08-06 13:40:00       230.0
 
 [69889 rows x 1 columns])

 utc_time                  data_value
 2018-08-06 13:50:00       230.0
 2018-08-06 14:00:00       230.0
 2018-08-06 14:10:00       230.0
 2018-08-06 14:20:00       230.0
 2018-08-06 14:30:00       230.0
 ...                         ...
 2020-06-14 02:50:00       280.0
 2020-06-14 03:00:00       280.0
 2020-06-14 03:10:00       280.0
 2020-06-14 03:20:00       280.0
 2020-06-14 03:30:00       280.0
 
 [97571 rows x 1 columns])

 utc_time                  data_value
 2020-06-14 03:40:00       280.0
 2020-06-14 03:50:00       280.0
 2020-06-14 04:00:00       280.0
 2020-06-14 04:10:00       280.0
 2020-06-14 04:20:00       280.0
 ...                         ...
 2020-06-15 00:10:00       280.0
 2020-06-15 00:20:00       270.0
 2020-06-15 00:30:00       270.0
 2020-06-15 00:40:00       270.0
 2020-06-15 00:50:00       280.0

 [128 rows x 1 columns])

As one can see, this produced three bins rather than the two I expected.

I could do something different, like changing the way I choose the resampling factor (e.g., finding the average time between timestamps, and multiplying that by (length of DataFrame / final_length) should yield a more conservative resampling factor), but that would, to my mind, be a mask to the underlying issue. Mainly, I'd love to understand why what's happening is happening. Which leads to...

Question

Does anyone know why this is happening, and what steps I might take to ensure we get the desired number of bins? I wonder if it's an offsetting issue - that is, although we see the first timestamp in the first bin as the first timestamp from the DataFrame, perhaps pandas is actually starting the bin before then?

For anyone who'd like to play along at home, the test DataFrame can be found here as a .csv. To get it in as a DataFrame:

df = pd.read_csv('test.csv', parse_dates=[0])

like image 579
erekalper Avatar asked Jul 16 '20 15:07

erekalper


People also ask

What does resample do in pandas?

Resampling generates a unique sampling distribution on the basis of the actual data. We can apply various frequency to resample our time series data. This is a very important technique in the field of analytics. There are many other types of time series frequency available.

What is resampling time series data?

The resampling recipe transforms time series data occurring in irregular time intervals into equispaced data. The recipe is also useful for transforming equispaced data from one frequency level to another (for example, minutes to hours).

How do you resample hourly data in Python?

Resample Hourly Data to Daily Data To simplify your plot which has a lot of data points due to the hourly records, you can aggregate the data for each day using the . resample() method. To aggregate or temporal resample the data for a time period, you can take all of the values for each day and summarize them.

What is resample used for?

Resampling is used in time series data. This is a convenience method for frequency conversion and resampling of time series data. Although it works on the condition that objects must have a datetime-like index for example, DatetimeIndex, PeriodIndex, or TimedeltaIndex.


1 Answers

Summary

  • Problem 1 & fix: The way you form the bins will make one extra bin since the bins created with df.resample() will be closed only on one end (left or right). Fix this with one of options listed in "1.".

  • Problem 2 & fix: The first bin left edge is at the start of that day ('2016-09-28 00:00:00') (See "2."). You can fix it by using kind='period' as argument to resample(). (See "3.")

1. Having a glance at the input data (& what kind of bins we need)

The input data is from 2016-09-28 21:10:00 to 2020-06-15 00:50:00, and using the resampling_factor you have, we get:

In [63]: df.index.min()
Out[63]: Timestamp('2016-09-28 21:10:00')

In [64]: df.index.min() + pd.Timedelta(f'{resampling_factor}S')
Out[64]: Timestamp('2018-08-07 11:00:00')

In [65]: _ + pd.Timedelta(f'{resampling_factor}S')
Out[65]: Timestamp('2020-06-15 00:50:00')

To partition data into two pieces with these timestamps, we would need bins to be

  • ['2016-09-28 21:10:00', '2018-08-07 11:00:00')
  • ['2018-08-07 11:00:00', '2020-06-15 00:50:00']

(The [ means closed end and ( means open end)

  • Here is one problem: You can not form bins that are closed from both ends. You will have to decide if you want to close the bins from left or right (argument closed='left'|'right',). With closed='left' you would have
    • ['2016-09-28 21:10:00', '2018-08-07 11:00:00')
    • ['2018-08-07 11:00:00', '2020-06-15 00:50:00')
    • ['2020-06-15 00:50:00', '2022-04-23 14:40:00') (only one entry here)

Possible fixes:

  1. Adjust your last timestamp by adding some time into it:
    last_timestamp = (df['utc_time'].max() +
                      pd.Timedelta('10 minutes')).timestamp()
  1. Make the resampling_factor a bit larger than you first calculated.
  2. Just use the first two dataframes from the df.resample and disregard the third which has only one or few entries

Choose which makes most sense in your application.

2. Looking at what we have now

  • From the df.resample docs, we know that the labels returned are the left bin edges
  • If we look the data, we see the what kind of labels there are now.

In [67]: resamp = df.resample(f'{resampling_factor}S')

In [68]: itr = iter(resamp)

In [69]: next(itr)
Out[69]:
(Timestamp('2016-09-28 00:00:00', freq='58542600S'),
                      data_value
 utc_time
 2016-09-28 21:10:00       140.0
 ...                         ...
 2018-08-06 13:40:00       230.0

 [69889 rows x 1 columns])

In [70]: next(itr)
Out[70]:
(Timestamp('2018-08-06 13:50:00', freq='58542600S'),
                      data_value
 utc_time
 2018-08-06 13:50:00       230.0
 ...                         ...
 2020-06-14 03:30:00       280.0

 [97571 rows x 1 columns])

In [71]: next(itr)
Out[71]:
(Timestamp('2020-06-14 03:40:00', freq='58542600S'),
                      data_value
 utc_time
 2020-06-14 03:40:00       280.0
 ...                         ...
 2020-06-15 00:50:00       280.0

 [128 rows x 1 columns])

  • The bins are therefore
    • ['2016-09-28 00:00:00', '2018-08-06 13:50:00')
    • ['2018-08-06 13:50:00', '2020-06-14 03:40:00')
    • ['2020-06-14 03:40:00', '2022-04-22 17:30:00') (Endpoint calculated by adding resampling_factor to the beginning of the bin.)
  • We see that the first bin does not start from the df['utc_time'].min (2016-09-28 21:10:00), but it starts from the beginning of that day (as you guessed)
  • Since the first bin starts before intended, we have data outside two bins, in a third bin.

3. Fixing the starting bin left edge

The kind argument can be either 'timestamp' or 'period'. If you change it into 'period', you will have following bins (with closed='left'):

  • ['2016-09-28 21:10:00', '2018-08-07 11:00:00') <-- fixed
  • ['2018-08-07 11:00:00', '2020-06-15 00:50:00')
  • ['2020-06-15 00:50:00', '2022-04-23 14:40:00') (Remove with options given in "1.")
like image 77
np8 Avatar answered Nov 10 '22 11:11

np8