In Pandas 0.14.1 , diff() doesn't generate values at the beginning of timeseries.
Using diff() seems to treat missing data differently than cumsum(), which assumes NaN == 0. I'm wondering if there is a way to make diff() assume 0 for previous missing data (missing because it's from before the beginning time series).
For example:
>print df
2014-05-01 A Apple 1
B Banana 2
2014-06-01 A Apple 3
B Banana 4
results in:
>print df.groupby(level=[1,2]).diff()
2014-05-01 A Apple NaN
B Banana NaN
2014-06-01 A Apple 2
B Banana 2
When the desired output is:
2014-05-01 A Apple 1
B Banana 2
2014-06-01 A Apple 2
B Banana 2
As far as I know, groupby(...).diff()
just calls np.diff
which always returns an array 1 (or n) shorter than what is passed to it.
But it should be pretty easy just to fill the missing data. Something like this?
In [175]: df
Out[175]:
d
a b c
2014-05-01 A Apple 1
B Banana 2
2014-06-01 A Apple 3
B Banana 4
In [176]: df['diff'] = df.groupby(level=[1,2])['d'].diff()
In [177]: df['diff'] = df['diff'].fillna(df['d'])
In [178]: df
Out[178]:
d diff
a b c
2014-05-01 A Apple 1 1
B Banana 2 2
2014-06-01 A Apple 3 2
B Banana 4 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With