Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Divide total sum equally to higher sampled time periods when upsampling with pandas

Tags:

python

pandas

I am trying to distribute the total sum of a time period evenly to the components of the higher sampled time period.

What I did:

>>> rng = pandas.PeriodIndex(start='2014-01-01', periods=2, freq='W')
>>> ts = pandas.Series([i+1 for i in range(len(rng))], index=rng)
>>> ts
2013-12-30/2014-01-05    1
2014-01-06/2014-01-12    2
Freq: W-SUN, dtype: float64

>>> ts.resample('D')
2013-12-30     1
2013-12-31   NaN
2014-01-01   NaN
2014-01-02   NaN
2014-01-03   NaN
2014-01-04   NaN
2014-01-05   NaN
2014-01-06     2
2014-01-07   NaN
2014-01-08   NaN
2014-01-09   NaN
2014-01-10   NaN
2014-01-11   NaN
2014-01-12   NaN
Freq: D, dtype: float64

What I actually want is:

>>> ts.resample('D', some_miracle_thing)
2013-12-30     1/7
2013-12-31     1/7
2014-01-01     1/7
2014-01-02     1/7
2014-01-03     1/7
2014-01-04     1/7
2014-01-05     1/7
2014-01-06     2/7
2014-01-07     2/7
2014-01-08     2/7
2014-01-09     2/7
2014-01-10     2/7
2014-01-11     2/7
2014-01-12     2/7
Freq: D, dtype: float64

Is there a way to do it

  1. Specifically – e.g., with a x/7 lambda function?
  2. Generically, so it works independently of the factor 7 (say 24 for hours to days and so on)?
like image 955
Serbitar Avatar asked Aug 08 '14 13:08

Serbitar


People also ask

How do I Upsample data in pandas?

First ensure that your dataframe has an index of type DateTimeIndex . Then use the resample function to either upsample (higher frequency) or downsample (lower frequency) your dataframe. Then apply an aggregator (e.g. sum ) to aggregate the values across the new sampling frequency.

What is resample time series?

To resample time series data means to summarize or aggregate the data by a new time period.

What does resample do in Python?

Resample time-series data. Convenience method for frequency conversion and resampling of time series. The object must have a datetime-like index ( DatetimeIndex , PeriodIndex , or TimedeltaIndex ), or the caller must pass the label of a datetime-like series/index to the on / level keyword parameter.


1 Answers

I hate this solution, but it works for upsampling when you're unsure of the number of new intervals. Going from week to day is easy, it's always 7 days / week. But I've found the number of intervals based on an upsample is usually unknown - this solution works for that.

The idea is to get the number of post-resample intervals into the initial pre-resampled dataframe, then re-resample and divide your data by the interval count. Side note - this is for a dataframe, not a series.

# Create unique group IDs by simply using the existing index (Assumes an integer, non-duplicated index)
df['group'] = df.index  

# Get the count of intervals for each post-resampled timestamp.
df['count'] = df.set_index('timestamp').resample('15min').ffill()['group'].value_counts()

# Resample all data again and fill so that the count is now included in every row.
df          = df.set_index('timestamp').resample('15min').ffill()

# Apply the division on the entire dataframe and clean up.
df          = df.div(df['count'], axis = 0).reset_index().drop(['group','count'], axis = 1)

I'd wrap this in a function and tuck it away so I never have to look at it again, with something like:

def distribute_upsample(df, index, freq)

Where index would be 'timestamp' and freq would be '15min'

like image 142
elPastor Avatar answered Oct 16 '22 15:10

elPastor