Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Resampling in Pandas while keeping value associations

Starting with something like this:

from pandas import DataFrame
time = np.array(('2015-08-01T00:00:00','2015-08-01T12:00:00'),dtype='datetime64[ns]')
heat_index = np.array([101,103])
air_temperature = np.array([96,95])

df = DataFrame({'heat_index':heat_index,'air_temperature':air_temperature},index=time)

yielding this for df:

                     air_temperature    heat_index
2015-08-01 07:00:00  96                 101
2015-08-01 19:00:00  95                 103

then resample daily:

df_daily = df.resample('24H',how='max')

To get this for df_daily:

            air_temperature     heat_index
2015-08-01  96                  103

So by resampling using how='max' pandas resamples each 24 hour period, taking the maximum value within that period from each column.

But as you can see looking at df output for 2015-08-01, that day's maximum heat index (which occurs at 19:00:00) does not correlate with air temperature occurred at the same time. That is, the heat index of 103F was caused with an air temperature of 95F. This association is lost through resampling, and we end up looking at the air temperature from a different part of the day.

Is there a way to resample just one column, and preserve the value in another column at the same index? So that the final outcome would look like this:

            air_temperature     heat_index
2015-08-01  95                  103

My first guess is to just resample the heat_index column...

df_daily = df.resample('24H',how={'heat_index':'max'})

to get...

            air_temperature
2015-08-01  103

...and then trying to do some sort of DataFrame.loc or DataFrame.ix from there, but have been unsuccessful. Any thoughts on how to find the related value after resampling (e.g. to find the air_temperature that occurred at the same time as what is later found to be the maximum heat_index)?

like image 519
csg2136 Avatar asked Aug 12 '15 22:08

csg2136


People also ask

What does resample in pandas do?

The resample() function is used to resample time-series data. Convenience method for frequency conversion and resampling of time series. Object must have a datetime-like index (DatetimeIndex, PeriodIndex, or TimedeltaIndex), or pass datetime-like values to the on or level keyword.

What is the difference between resample and Asfreq?

resample is more general than asfreq . For example, using resample I can pass an arbitrary function to perform binning over a Series or DataFrame object in bins of arbitrary size. asfreq is a concise way of changing the frequency of a DatetimeIndex object. It also provides padding functionality.

How do you Upsample in pandas?

First ensure that your dataframe has an index of type DateTimeIndex . Then use the resample function to either upsample (higher frequency) or downsample (lower frequency) your dataframe. Then apply an aggregator (e.g. sum ) to aggregate the values across the new sampling frequency.


1 Answers

Here's one way - the .groupby(TimeGrouper()) is essentially what resample is doing, then the aggregation function filters each group to the max observation.

In [60]: (df.groupby(pd.TimeGrouper('24H'))
            .agg(lambda df: df.loc[df['heat_index'].idxmax(), :]))

Out[60]: 
            air_temperature  heat_index
2015-08-01               95         103
like image 122
chrisb Avatar answered Nov 03 '22 20:11

chrisb